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Abstract 

This report presents a thorough convergence analysis of Kronecker graphical lasso (KGLasso) algo- 
rithms for estimating the covariance of an i.i.d. Gaussian random sample under a sparse Kronecker-product 
covariance model. The KGlasso model, originally called the transposable regularized covariance model 
by Allen et al HI, implements a pair of i\ penalties on each Kronecker factor to enforce sparsity in 
the covariance estimator. The KGlasso algorithm generalizes Glasso, introduced by Yuan and Lin [2| 
and Banerjee et al 0, to estimate covariances having Kronecker product form. It also generalizes the 
unpenalized ML flip-flop (FF) algorithm of Dutilleul |4| and Werner et al [5| to estimation of sparse 
Kronecker factors. We establish that the KGlasso iterates converge pointwise to a local maximum of the 
penalized likelihood function. We derive high dimensional rates of convergence to the true covariance 
as both the number of samples and the number of variables go to infinity. Our results establish that 
KGlasso has significantly faster asymptotic convergence than FF and Glasso. Our results establish that 
KGlasso has significantly faster asymptotic convergence than FF and Glasso. Simulations are presented 
that validate the results of our analysis. For example, for a sparse 10,000 x 10,000 covariance matrix 
equal to the Kronecker product of two 100 x 100 matrices, the root mean squared error of the inverse 
covariance estimate using FF is 3.5 times larger than that obtainable using KGlasso. 

Index Terms 

Sparsity, structured covariance estimation, penalized maximum likelihood, graphical lasso, direct 
product representation. 

I. Introduction 

Covariance estimation is a problem of great interest in many different disciplines, including machine 
learning, signal processing, economics and bioinformatics. In many applications the number of variables 
is very large, e.g., in the tens or hundreds of thousands, leading to a number of covariance parameters that 
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greatly exceeds the number of observations. To address this problem constraints are frequently imposed 
on the covariance to reduce the number of parameters in the model. For example, the Glasso model of 
Yuan and Lin |2J and Banerjee et al Q imposes sparsity constraints on the covariance. The Kronecker 
product model of Dutilleul J4[ and Werner et al Q assumes that the covariance can be represented 
as the Kronecker product of two lower dimensional covariance matrices. The transposable regularized 
covariance model of Allen et al |Q0] imposes a combination of sparsity and Kronecker product form on the 
covariance. When there is no missing data, an extension of the alternating optimization algorithm of [4], 
|[5l , called the flip flop (FF) algorithm, can be applied to estimate the parameters of this combined sparse 
and Kronecker product model. In this report we call this algorithm the Kronecker Glasso (KGlasso) and 
we thoroughly analyze convergence of the algorithm in the high dimensional setting. 

As in J5] we assume that there are pf variables whose covariance So na s the separable positive definite 
Kronecker product representation: 

S = A ® B (1) 

where Ao is a p x p positive definite matrix and Bo is an / x / positive definite matrix. This model ([TJ 
is relevant to channel modeling for MIMO wireless communications, where Ao is a transmit covariance 
matrix and Bo is a receive covariance matrix |6|. The model is also relevant to other transposable models 
arising in recommendation systems like NetFlix and in gene expression analysis HI. 

The Kronecker product Gaussian graphical model has been known for a long time as the matrix 
normal distribution in the statistics community Q, flU, JSJ. Various properties of the matrix variate 
normal distribution have been studied in JS). Let us rewrite the problem into matrix form. Consider a 
p x / random matrix Z that follows a matrix normal distribution-i.e. Z ~ N p j(0; Aq, Bo) flU. Then, 
Bo is the row covariance matrix and Ao is the column covariance matrix-i.e., Z : fc ~ N(0, [Bo]fc,fcAo) 
and Zj >: ~ N(0, [Ao]i,jBo) Q This model further finds applications in geostatistics [9] and genomics 
ifTOl . Further applications of matrix- variate normal models include collaborative filtering |fTT|. multi-task 
learning [12] and face recognition (13]. The Kronecker factorization ([!]) can easily be generalized to the 
/c-fold case, where So = Ai ® A2 ® • • • ® A&. 

Under the assumption that the measurements are multivariate Gaussian with covariance having the 
Kronecker product form ([I]), the maximum likelihood (ML) estimator can be formulated lfl4l . While the 
ML estimator has no known closed-form solution, an approximation to the solution can be iteratively 

'Here, Z», : is the ith row and Z : ,fe is the fcth column of the matrix Z. For concreteness, assume z = [zf , . . . ,Zp] T ~ 
iV(0, So). Then, Z = [zi, . . . , z p ] T is the p x / data matrix with row covariance Bo and column covariance Aq. 



February 13, 2013 



DRAFT 



3 



computed via an alternating algorithm: the flip-flop (FF) algorithm lfl4ll . J5j. As compared to the standard 
saturated (unstructured) covariance model, the number of unknown parameters in ([TJ is reduced from 
order Q(p 2 f 2 ) to order Q{p 2 ) + ©(/ 2 ). This results in a significant reduction in the mean squared 
error (MSE) and the computational complexity of the maximum likelihood (ML) covariance estimator. 
This report establishes that further reductions MSE are achievable when the Kronecker matrix factors 
are known to have sparse inverses, i.e., the measurements obey a sparse Kronecker structured Gaussian 
graphical model. 

The graphical lasso (Glasso) estimator was originally proposed in (2|, (3J for estimating a sparse inverse 
covariance, also called the precision matrix, under an i.i.d. Gaussian observation model. An algorithm 
for efficiently solving the nonsmooth optimization problem that arises in the Glasso estimator, based on 
ideas from (3j , was proposed in ||T51 . Glasso has been applied to the time- varying coefficients setting in 
Zhou et al lfT6l using the kernel estimator for covariances at a target time. Rothman et al [fl/7 ] derived 
high dimensional convergence rates for a slight variant of Glasso, i.e., only the off-diagonal entries of 
the estimated precision matrix were penalized using an l\ -penalty. The high dimensional convergence 
rate of Glasso was established by Ravikumar et al lTT8l . This report extends their analysis to the case 
that the covariance has Kronecker structure ([I]), showing that significantly higher rates of convergence 
are achievable. 

The main contribution is the derivation of the high-dimensional MSE convergence rates for KGlasso as 
n, p and / go to infinity. When both Kronecker factors are sparse, it is shown that KGlasso strictly out- 
performs FF and Glasso in terms of MSE convergence rate. More specifically, we show KGlasso achieves 
a convergence rate of Op ^ (p + ^ 1o § ™ ax (P'/. ra ) ^ an( i pp ac hieves a rate of Op ^ — + ^ ) lo ^ max (p>/' n ) ^ as 
n — y oo, while it is known ifTTl . ifTol that Glasso achieves a rate of Op ^ ^°gmax(p ,/,n) ^ ^ w h ere s 
denotes the number of off-diagonal nonzero elements in the true precision matrix @o- Simulations show 
that the performance improvements predicted by the high-dimensional analysis continue to hold for small 



sample size and moderate matrix dimension. For the example studied in Sec. VIII the empirical MSE of 
KGlasso is significantly lower than that of Glasso and FF for p = f = 100 over the range of n from 10 
to 100. 

The starting point for the MSE convergence analysis is the large-sample analysis of the FF algorithm 
(Thm. 1 in Q). The KGlasso convergence proof uses a large deviation inequality that shows that the 
dimension of one estimated Kronecker factor, say A, acts as a multiplier on the number of independent 
samples when performing inference on the other factor B. This result is then used to obtain optimal MSE 
rates in terms of Frobenius norm error between the KGlasso estimated matrix and the ground truth. The 
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asymptotic MSE convergence analysis is useful since it can be used to guide the selection of sparsity 
regularization parameters and to determine minimum sample size requirements. 

An anonymous reviewer alerted the authors to the related work of Yin and Li |[T0l . published after 
submission of this paper for publication. Yin and Li obtain high-dimensional MSE bounds for the same 
matrix normal estimation problem considered here. However, our MSE bounds are tighter than the bounds 
given in Yin and Li. In particular, neglecting terms of order log(p/), our bounds are of order p + / as 
compared to Yin and Li's bounds of order pf, which is significantly weaker for large p, f. We obtain 
improved bounds due to the use of a tighter concentration inequality, established in Lemma [5] 



A. Outline 

The outline of the report is as follows. Section III] introduces the notation that will be used throughout 



the report. In Section |ITTJ the graphical lasso framework is introduced. Section [TV] uses this framework 
to introduce the KGlasso algorithm. Section [V] shows convergence of KGlasso and characterizes its limit 
points. The high dimensional MSE convergence rate derivation for the FF algorithm is included in Section 



VI Section VII| presents a high-dimensional MSE rate result that is used to establish the superiority of 



KGlasso as compared to FF and standard Glasso, under the sparse Kronecker product representation ([TJ. 



Section VIII presents simulations that empirically validate the theoretical convergence rates obtained in 
Section |W] 



II. Notation 

For a square matrix M, define |M|i = ||vec(M)|| 1 and |M|oo = HvecfM)^, where vec(M) denotes 
the vectorized form of M (concatenation of columns into a vector). ||M|| 2 is the spectral norm of 
M. Mj j and [M]jj are the (i,j)th element of M. Let the inverse transformation (from a vector to a 
matrix) be defined as: vec _1 (x) = X, where x = vec(X). Define the pf x pf permutation operator 
Kp / such that K p jvec(N) = vec(N T ) for any p x / matrix N. For a symmetric matrix M, A(M) 
will denote the vector of real eigenvalues of M and define X max (M.) = ||M|| 2 = maxAj(M) for p.d. 
symmetric matrix, and A m j n (M) = minAj(M). Define the sparsity parameter associated with M as 
sm = card({(ii, 12) : [M]^^ ^ 0,i\ ^= 12}). Let k(M) := ^""'(m) denote the condition number of a 
symmetric matrix M. 

For a matrix M of size pf x pf, let {M(i, J)}f J= i denote its / x / block submatrices, where each 
block submatrix is M(i, j) = Wi\(i-i)f+i-.if,(j^i)f+i:jf- Also let {M.(k,l)}j : l=1 denote the p x p block 
submatrices of the permuted matrix M = Kj^MK^j. 
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Define the set of symmetric matrices S p = {A £ : A = A T }, the set of symmetric positive 

semidefinite (psd) matrices S' = {A £ W xp : A = A T , z T Az > 0, Vz G W}, and the set of symmetric 
positive definite (pd) matrices S+ + = {A G W xp : A = A T , z T Az > 0, Vz / 0}. Id is a d x d identity 
matrix. It can be shown that S+ + is a convex set, but is not closed |[T9ll . Note that is simply the 
interior of the closed convex cone Sj\ 

Statistical convergence rates will be denoted by the Op(-) notation, which is defined as follows. 
Consider a sequence of real random variables {X n } ne ^ defined on a probability space (fi, J 7 , P) and a 
deterministic (positive) sequence of reals {6„} ng ^. By X n = Op(l) is meant: sup ngN Pr(|X n | > K) — > 
as K — > oo. The notation X n = Op(b n ) is equivalent to ^ = Op(l). By X n = o p (l) is meant 
Pr(|X„| > e) — > as n — > oo for any e > 0. By A n x b n is meant c\ < ^ < C2 for all n, where 
c\,C2 > are absolute constants. 

III. Graphical Lasso Framework 

For simplicity, we assume the number of Kronecker components is k = 2. Available are n i.i.d. 
multivariate Gaussian observations {zt}" =1 , where zj E W°f, having zero-mean and covariance equal to 
5] = Ao <£> Bo- Then, the log-likelihood is proportional to: 

Z(£) := logdet(I]- 1 ) - trCE" 1 ^), (2) 

where S is the positive definite covariance matrix and S n = - Ylt=i z t z T i s tne sample covariance matrix. 
Recent work 0, lTT3Tl has considered l\ -penalized maximum likelihood estimators for the saturated model 
where 5] belongs to the unrestricted cone of positive definite matrices. These estimators are known as 
graphical lasso (Glasso) estimators and are the solution to the i\ -penalized minimization problem: 

S n G arg min {-/(E) + A|I]- 1 |i}, (3) 

lies 1 ' 



where A > is a regularization parameter. If A > and S n is positive definite, then S„ in (|3j is the 
unique minimizer. 

A fast iterative algorithm, based on a block coordinate descent approach, exhibiting a computational 
complexity C((p/) 4 ), was developed in ITT31 to solve the convex program (01). Under the assumption 



Ax J l ^Ell solution of (3 1 was shown to have high dimensional convergence rate [17]: 



|G(S to ,A)-0 o || f = Op U <Pf + °)**(Pf) \ (4) 
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where s is an upper bound on the number of non-zero off-diagonal elements of ©o- When s = 0(pf), 
this rate is better than the non-regularized sample covariance estimator: 



M\f = Op\\I—\. (5) 



n 



IV. Kronecker Graphical Lasso 



Let So := Ao <8> Bo denote the true covariance matrix, where Ao := Xq 1 and Bo = Yq" 1 are the true 
Kronecker factors. Let Aj n ^ denote the initial guess of Ao = Xq . 
Define J(X, Y) as the negative log-likelihood 

J(X,Y)=tr((X®Y)S n )-/logdet(X) 

-plogdet(Y) (6) 

Although the objective ^ is not jointly convex in (X, Y), it is biconvex. This motivates the flip-flop 
algorithm H, @. Adapting the notation from Q, define the mappings A(-),B(-): 

1 1 — 
A(B) = - J2 [B-%iSn(l,k), (7) 



B(A) = -J2 [A-^ijSnUJ), (8) 

^xT p ij=l 

where S n = K^S n K p j (see Sec. II for definition of K p j). For fixed B G <S+ + , A(B) in (jvj) is the 
minimizer of J(A _1 , B _1 ) over A G A similar interpretation holds for (|8j). The flip-flop algorithm 
starts with some arbitrary p.d. matrix Aj n ^ and computes B using ([8]), then A using ([7]), and repeats 
until convergence. This algorithm does not account for sparsity. 

If ©o = Xo <g> Yo is a sparse matrix, which implies that at least one of Xo or Yo is sparse, one can 
penalize the outputs of the flip-flop algorithm and minimize 

J A (X, Y) = J(X, Y) + Ax|X|i + Ay|Y|i. (9) 

This leads to an algorithm that we call KGlasso (see Algorithm [T]), which sparsifies the Kronecker factors 
in proportion to the parameters Xx , Ay > 0. 

The Glasso mapping ^ is written as G(-, A) : S d — > S d , 

G(T,A) = arg min |tr(0T) - logdetf©) + A|0|i). (10) 

®<=s d , I ' J 
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Algorithm 1 Kronecker Graphical Lasso (KGlasso) 



l: Input: S n , p, f, n, Xx > 0, Ay > 

2: Output: & KG i 

asso 

3: Initialize Ai n u to be positive definite satisfying Assumption [T] 



4: X <- 

5: repeat 

6: B <- i Ef )i= i [XkA(i, i) (see Eq. §) 
7: Y «— G(B, ^c), where G(-, •) is defined in ( 10) 
8 : A <- } E(/=i [Y] fc fc) (see Eq. (g» 
9: X^G(A, ^) 
10: until convergence 

11: ©^G/asso ^— X (g) Y 



As compared to the 0(p 4 / 4 ) computational complexity of Glasso, KGlasso has a computational com- 
plexity of only 0(p A + / 4 ) 

V. Convergence of KGlasso Iterations 

In this section, we provide an alternative characterization of the KGlasso algorithm and prove conver- 
gence to a local minimum of the objective function. 

A. Block-Coordinate Reformulation of KGlasso 

The KGlasso algorithm can be re-formulated as a block-coordinate optimization of the penalized 
objective function [9] 

Lemma 1. 1) Assume Ax, Ay > and X G S+ + , Y G When one argument of J^(X, Y) is 

fixed, the objective function ([9]) is convex in the other argument. 
2) Assume S n is positive definite. Consider J\(X, Y) in (j9j) w/f/i matrix X G /ucecf. 77je?i, f/ie 
fifwa/ subproblem for minimizing J\(X, Y) over Y is: 

max logdet(W) (11) 

|W-iEf >i = 1 X«.AC3,i)|oo<Ay 
2 In the sparse Kronecker factor case, this cost can be reduced to 0(p 3 + / 3 ). 
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where Ay := Xy/p- 



On the other hand, consider A9y with matrix Y G SL, fixed. Then, the dual problem for minimizing 



Ja(X, Y) over X is: 



max_ logdet(Z) (12) 

|Z-7Ej.i=iY,.A(M0U<A* 



where S n := K^S n K p j a«c? X x := Ax//- 



3) Strong duality holds for {11 ) and (172 



4) 77ie solutions to (11) and [12) are positive definite. 



Proof: See Appendix. 



Note that both dual subproblems (11) and ( 12 1 have a unique solution and the maximum is attained in 
each one. This follows from the fact that in each case we are maximizing a strictly concave function over 
a closed convex set. Lemma jlj is similar to the result obtained in 0, but with Ylij=i Xi.jS„(j, i), Xy) 
playing the role of (S ra , A), for the "fixed X" subproblem. 

B. Limit Point Characterization of KGlasso 

We will first show that KGlasso converges to a fixed point. Let J\(X, Y) be as defined in (|9]) and 
define jf ) = J A (X (fe) , Y^) far ft = 0, 1,2, ... . 

(k) (oo) 

Theorem 1. If S n is positive definite, KGlasso converges to a fixed point. Also, we have J\ \, J\ . 

Proof: See Appendix. ■ 
The following analysis uses Theorem [T] to prove convergence of the KGlasso algorithm to a local 
minimum. To do this, we consider a more general setting. The KGlasso algorithm is a special case of 
Algorithm [2] Assuming a fc-fold Kronecker product structure for the covariance matrix, the optimization 
problem ([9]) can be written in the form: 

k 

J A (Xi, . . . , X fc ) = J (Xi, . . . ,X fc ) + Ji(Xi) + V/i(Xi) (13) 

i=l 

where X; e fi$ + , ^(X*) := |X m |i, J (Xi, . . . ,X fc ) := tr((Xi ® X a ® ■ ■ ■ ® X fc )S„) and Jj(Xi) = 
~ iliVi • lo § det(Xi) for i = 1, . . . , k. 



Without loss of generality, by reshaping matrices into appropriate vectors, (13 1 can be rewritten as: 

k 

J A (xi, . . . ,x fe ) = Jo(xi, . . . ,x fc ) + 2j «/i(xj) + Xirji(xi) (14) 

8=1 
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where the optimization variable is x := [x^xj, . . . , x^] T G M d , where Xj G and d' = Yli=i 
For example, ^(Xj) = |Xj|i = ||vec(Xj)|| 1 = Hx^^ = ?7i(xj). The mapping {Jj}f =0 can be similarly 
written in terms of the vectors x, instead of the matrices X,-. 



The reader can verify that the objective function ( 13 1 satisfies the properties (for n > pf) in Appendix 

D. 

The general optimization problem of interest here is: 

min Ja(x) subject to vec _1 (xj) = X; G , , i = l,...,k (15) 

xeR d ' 



The positive definiteness constraints are automatically taken care of by the construction of the algorithm 
(see Lemma [j]4). Let the dimension of the covariance matrix be denoted by d := Yli=i di- We assume 
> d. To solve ( [75] ), a block coordinate-descent penalized algorithm is constructed: 



Algorithm 2 Block Coordinate-Descent Penalized Algorithm 



1: Input: S n , di, n, e > 0, Aj > 
2: Output: 

3: Initialize X^, Xf], . . . , X° matrices as positive definite matrices, e.g., scaled identity. 



4: O <- X? ® X° ® • • • ® X° 



5: m ^— 
6: repeat 

7: prev 

8: X^ ^argmin Al ^ «/A(A 1 ,X™- 1 ,...,X™- 1 ; 

9: X^ <- arg mm A ^ JaW, A 2 , . . . , X™" 1 ) 

10: : 

11: X- <- argmin A ^ J A (Xf , X£\ . . . , A fc ) 

12: «- Xf ® X^ ® • • • <g> X™ 

13: m •<— m + 1 

14: until ||0p re v — 0|| < e 



Remark 1. The positive definiteness constraint at each coordinate descent iteration of Algorithms^and 
^need not be explicit since the objective function J\(-) acts as a logarithmic barrier function. 

Note that Algorithm [TJ is a special case of Algorithm [2] An extension of Theorem [TJ assuming n > d 
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or > — oo, based on induction, can be used to show that the limit points of the sequence of iterates 
(x m ) m > = (xf , . . . , x^) m > are fixed points. 

Remark 2. Note that a necessary condition for x* to minimize J\ is £ d,J\(x*). This is not sufficient 
however. 

We next show that the limit point(s) of (x m ) m >o are nonempty and are local minima. 

Theorem 2. Let (x m ) = ( x™, . . . ; x fc l )m>o be a sequence generated by Algorithm^ Assume n > d. 

1) The algorithm converges to a local minimum. 

2) If x° is not a local minimum, strict descent follows. 

Proof: See Appendix. ■ 
As a consequence of Theorem [2] we have the following corollary. 

Corollary 1. Assuming n > pf, the KGlasso algorithm converges to a local minimizer of the objective 
function Q. 

VI. High Dimensional Consistency of FF 
In this section, we show that the flip-flop (FF) algorithm achieves the optimal (non-sparse) statistical 



Thm. 



4) allows us to establish that the 



convergence rate of Op yy - This result (see Thm. 

proposed KGlasso has significantly improved MSE convergence rate (see Thm. [5]). We make the following 
standard assumption on the spectra of the Kronecker factors. 

Assumption 1. Uniformly Bounded Spectra 

There exist absolute constants k A ,kA,k B ,kB,k A . n . t ,kA init such that: 
la. < k A < A min (A ) < X max (A ) <k A <oo 
lb. < k B < A min (B ) < \ ma x(Bo) <k B <oo 

2. < kiA init < A m j n (Aj n j^) < ^rnaxi-^-init) — kA init < OO 

Let Ylpp(3) := A(B(Aj n j t )) (g) B(A(B(Aj n #))) denote the 3-step (noniterative) version of the flip- 
flop algorithm [5]. More generally, let Ylpp(k) denote the A;-step version of the flip-flop algorithm, and 
denote its inverse as Qpp(k) = (Ylpp(k))~ 1 . 



Theorem 3. Let Ao, Bo, A-init satisfy Assumption^ Let M = max(j?, /, n), M = (m&x{pf 1 I 2 + y/f, fp 
Assume (p/) m 2 ^° g ^ 2 M ^ = o(n) as p, f,n — > oo. Also, assume p, f > 2 and ^W}og K I _ Qif^y 

February 13, 2013 DRAFT 



11 



Then, for k > 3 finite, 

\\@FF(k)-®o\\ F = Op(^^^j (16) 

as p, f, n — > oo. 

Proof: See Appendix. ■ 
The next theorem offers a shorter proof of essentially the same convergence rate ( [To} (up to a logarithm 
factor) under milder sufficient conditions. 

Theorem 4. Let Ao, Bo, and Aj n # satisfy Assumption^and define M = max(p, /, n). Asswme p > f > 2 
and p log M < C"n for some finite constant C" > 0. Finally, assume n > y + 1. J/ien, /or k > 2 finite, 



|e„ ( *)-e,l| r = o„ U M±aigj£ (n) 



<35 n — )• oo. 



Proof: See Appendix. ■ 

Remark 3. T/ie sufficient conditions are symmetric with respect to p and f-i.e. for f > p, the corre- 
sponding conditions would become f log M < C"n for some constant C" > 0, and n > *- + 1. 

For the special case of p = f, the sufficient conditions of Thm. [3]become p 4 (log M) 3 = o(n), while the 
sufficient conditions of Thm. [4] become plogM = 0{n). Thus, to achieve accurate covariance estimation 
for arbitrarily structured Kronecker factors, the minimal sample size needed is n = £l((p 2 + f 2 ) logM). 

The bound ( [T7] ) specifies the rate of reduction of the estimation error for the multi-iteration FF 
algorithm, which includes the three step FF algorithm (k = 3) fl5j as a special case. The error reduction 
decreases as long as p and / do not increase too quickly in n. 

Note that (17) specifies a faster rate than that of the naive sample covariance matrix estimator ([5]). 



Furthemore, since the computational complexity for FF is 0(p 2 + f 2 ) which is less than the 0{p 2 f 2 ) 
complexity of SCM, by exploiting Kronecker structure FF simultaneously achieves improved MSE 
performance and reduced computational complexity. 

VII. High Dimensional Consistency of KGlasso 
In this section, consistency is established for KGlasso as p, /, n — > oo. 
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Fig. 1 . Root mean square error (RMSE) performance for the flip-flop estimator (FF) (left) and for the standard sample covariance 
matrix estimator (SCM) (right). SCM performs very poorly in comparison to FF when the covariance matrix decomposes as a 
Kronecker product. Here, the sample size n is fixed and the dimensions of the Kronecker factors (p, /) vary. Equation \\6\ is 
plotted on the left and Equation {5} on the right. Exploiting structure yields a significant reduction in MSE. The magnitude of 
the colormap reflects the error up to a constant scaling. The colormap in both images is the same, which visually shows the 
lower RMSE of FF as compared to SCM. 



A. MSE convergence rate of KGlasso 

Define ®KGiasso(k) as the output of the kth compression and sparsification step (two of these steps 
constitute a full KGlasso iteration). 

Theorem 5. Let Ao,Bq, Aj n j£ satisfy Assumption |ij Let M = max(p, /, n). Let Ay^ >c PyJ ^^p^ ' an d 
A§> x (J, + jj) fxJW> x (7P + 7/ ) P\/W aspJ,n^oo for all k > 1 and k' > 2. 
Assume sparse Xq and Yq, i.e. sx = 0(p), sy = 0(f). Assume max ( ?, - ) log M = o(n). Then, for 



J> p 

k > 2 finite, we have 



\s KGlasso (k) -e \\ F = o P l ^ + / ^ logM (is) 



as p, f,n — ^ oo. 



Proof: See Appendix. ■ 
Theorem [5] of fers a strict improvement over standard Glasso ifTTl . Q and generalizes Thm. 1 in ifTTll to 
the case of sparse Kronecker product structure. Thm. [^generalizes Thm.[4]to the case of sparse Kronecker 
structure. Comparison between the error expressions ([4]), ( [17] ) and ( p~8] ) show that, by exploiting both 
Kronecker structure and sparsity, KGlasso can attain significantly lower estimation error than standard 
Glasso IfTTl and FF Q. To achieve accurate covariance estimation for the sparse Kronecker product 
model, the minimal sample size needed is n = Q((p + /) logM). 

Although Thm. [5] shows a rate on the inverse covariance matrix, this asymptotic rate can be shown to 
hold for the covariance matrix as well (i.e., the inverse of ® KGlasso)- 
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Let Bi := G(B(A inU ), X?)' 1 , where G is defined in {Tol. Then, e KG i asso (l) = G(A(Bi), X^) 



G(B(Aj n j t ), Xy ) denotes the KGlasso output after the the first two steps of the KGlasso algorithm (or 
one KGlasso iteration). A graphical depiction of the first three steps of KGlasso is shown in Fig. [2] 
Define B x = G(B(A init ), X^)' 1 , where G is given in (El. Then, ® KG iasso(l) = G(A(B 1 ),A?) ® 



G(B(Aj n j t ), Xy ) denotes the KGlasso output after the the first two steps of the KGlasso algorithm (or 
one KGlasso iteration). Although Thm. [5] shows a rate on the inverse covariance matrix, this asymptotic 
rate can be shown to hold for the covariance matrix as well (see proof of Thm. [5] in Appendix). 




Fig. 2. Illustration of first three iterations of KGlasso. The squares around the blue dots represent the 1^ balls controlled by 
the regularization parameter (see dual programs \ \\\ and l|12}). As the regularization parameters tend to zero, the balls shrink 
to the blue points, and KGlasso becomes identical to the FF algorithm. 



Figures [3] and [4] graphically compare the MSE convergence rates of KGlasso, FF and standard Glasso 
as a function of p, f for fixed n. Note that the standard Glasso algorithm would yield an inferior rate to 



£8) (recall Q). 

The minimal sample size required to achieve accurate covariance estimation is graphically depicted in 
Fig. [5] for the special case p = f. The regions below the lines are the MSE convergence regions-i.e., the 
MSE convergence rate goes to zero as p, n grow together to infinity at a certain growth rate controlled 
by these regions. It is shown that KGlasso allows the dimension p to grow almost linearly in n and still 



achieve accurate covariance estimation (see (18l) and thus, uniformly outperforms FF, Glasso and the 



February 13, 2013 



DRAFT 



14 




1 
I 



Fig. 3. Root mean square error performance for Kronecker graphical lasso estimator (KGlasso) (left) and flip-flop estimator 
(FF) (right). FF performs very poorly in comparison to KGlasso when the covariance matrix decomposes as a Kronecker product 
and both Kronecker factors are sparse. The bound in Equation l | 1 S[ i is plotted on the left and that in Equation |T6j on the right. 
The magnitude of the colormap reflects the error up to a constant scaling. 




Fig. 4. Root mean square error performance for Kronecker graphical lasso estimator (KGlasso) (left) and standard Glasso 
estimator (Glasso) (right). Glasso performs very poorly in comparison to KGlasso when the covariance matrix decomposes as a 
Kronecker product and both Kronecker factors are sparse. The bound in Equation < | 1 8| > is plotted on the left and that in Equation 
{4| on the right. The magnitude of the colormap reflects the error up to a constant scaling. 



naive SCM estimators in the case both Kronecker factors are sparse. 



B. Discussion 

Theorem [5] is established using the large deviation bound in Lemma [5] We provide some intuition 

— 1 1/2 

on this bound below. Assume that Xj n j 4 = Xo, or Aj n # = X inii = Ao- Define W = X ® I p and 
Zf = Wz[, with i.i.d. z t ~ iV(0, Ao ® Bo), t = 1, . . . , n. Then, z t has block-diagonal covariance 

Cov(zj) = I p B . 

When W is applied to the transformed pfxpf sample covariance matrix, S„ := WS n W T , the first step 
of KGlasso produces an iterate y1 1} = G(B, Ay) with B = i Yl=i ( reca11 < 8 »- For suitable 

Ay = \y\ Yn^ converges to Yo with respect to maximal elementwise norm at a rate Op fy^p^Y 
The convergence of Y^ 1 ^ is easily established by applying the Chernoff bound and invoking the jointly 
Gaussian property of the measurements and the block diagonal structure of Cov(z t ). Lemma [5] in the 
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Fig. 5. Graphical depiction of minimal sample size required for KGlasso, FF, Glasso and naive SCM estimators to achieve 
accurate covariance estimation. The region below the lines constitute the MSE convergence regions-i.e. traveling along a path 
(p(n), n) as n — > oo within such regions implies the MSE convergence rate tends to zero (see {5},{4},l[T7]l and l|18|). 

Appendix establishes that this rate holds even if ~Ki n u / Xrj in Assumption [T] In view of the rate of 
convergence of to achieve a reduction in the MSE of Y, either the sample size n or the dimension 
p must increase. Lemma [5] provides a tight bound that makes the dependence of the convergence rate 
explicit in p, f and n. Theorem [5] uses Lemma [5] to show that KGlasso converges to Xq <8> Yq with rate 




with respect to Frobenius norm. 



VIII. Simulation Results 

In this section, we empirically validate the convergence rates established in previous sections using 
Monte Carlo simulation. 

Each iteration of the KGlasso involves solving an l\ penalized covariance estimation problem of 
dimension 100 x 100 (Step 6 and Step 8 of KGlasso specified by Algorithm [TJ. To solve these small 
sparse covariance estimation problems we used the Glasso algorithm of Hsieh et al [20] where the Glasso 
stopping criterion was determined by monitoring when the duality gap falls below a threshold of 10~ 3 . 

To evaluate performance, Monte Carlo simulations were used. Unless otherwise specified, the true 
matrices Xq := Aq 1 and Yq := B^ 1 were unstructured randomly generated positive definite matrices 
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based on an Erdos-Renyi graph model. First, a square binary matrix C was generated based on indepen- 
dently and identically distributing "Os" with a probability p* and "Is" with a probability 1 — p*. Then, 
C := (C + C T )/2 symmetrizes the matrix. The perturbation level p was selected as p = 0.05 — A m j n (C), 
producing Yo := C + pi/, the sparse inverse matrix. There was a total of 20 trial runs for each fixed 
number of samples n. Performance assessment was based on normalized Frobenius norm error in the 
covariance and precision matrix estimates. The normalized error was calculated using 



1 * ' m L 

— y 



X " »E -S(i)|& 



\ Nmc [ISollJ 

where Nmc is the number of Monte Carlo runs and is the covariance output from the ith trial run. 
The same formula can be adapted to calculate the normalized error in the precision matrix &q. In the 
implementation of KGlasso, the regularization parameters were chosen as follows. The initialization was 
Xj„j 4 = I p . The regularization parameters were selected as Ay^ = c y ^ 1 ^^-, = c x ^J^^- + \y\ 
Ay^ = A^, A^ = A^, etc. For Examples 1 and 2 below, the (positive) scaling constants (c x ,c y ) in 
front of the regularization parameters were chosen experimentally to optimize respective performances. 
For Example 3, we simply set c x = c y = 0.4. 

A. Example 1 

We consider the simple case that Xo and Yo are sparse matrices of dimensions p = 20 and / = 10. 
Figure [6] shows that Xo (8> Yo is a perturbation of I p f. Figures [7] and [8] compare the root-mean squared 
error (RMSE) performance in precision and covariance matrices as a function of n. As expected, KGlasso 
outperforms both naive Glasso and FF over the range of n for both the covariance and the inverse 
covariance estimation problem. As expected, the FF algorithm suffers in the small sample regime. KGlasso 
outperforms FF in this regime since it exploits sparsity in addition to Kronecker structure. 

B. Example 2 

We consider the case when Aq is identity and Yq is dense (see Fig. [9]). Figures 10 and 11 show similar 



trends to those exhibited in Figures [7] and [8] for the case that both Xo and Yo are sparse. 
C. Example 3 

We considered the setting where Xo and Yo are large sparse matrices of dimension p = f = 100 (see 
Fig. [12]). Only 5% of the off-diagonal entries were nonzero for both matrices Xq and Yq. The dimension 
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Fig. 6. Doubly sparse Kronecker matrix representation for simulation example 1. Left panel: left Kronecker factor. Middle 
panel: right Kronecker factor. Right panel: Kronecker product inverse covariance matrix 




10 1 10 2 10 3 



Sample size (n) 



Fig. 7. Normalized RMSE of precision matrix estimate = £ 1 as a function of sample size n for structure exhibited in 
Fig. [6] KGlasso (Kronecker graphical lasso) uniformly outperforms FF (flip-flop) algorithm and standard Glasso algorithm for 
all n. Here, p = 20 and / = 10. 





10 

Sample size (n) 



Fig. 8. Normalized RMSE of covariance matrix estimate £ as a function of sample size n for structure exhibited in Fig. [6] 
KGlasso (Kronecker graphical lasso) uniformly outperforms FF (flip-flop) algorithm and standard Glasso algorithm for all n. 
Here, p = 20 and / = 10. 
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Fig. 9. Sparse Kronecker matrix representation for simulation example 2. Left panel: left Kronecker factor. Middle panel: right 
Kronecker factor. Right panel: Kronecker product inverse covariance matrix. 




Sample size (n) 



Fig. 10. Normalized RMSE performance for precision matrix as a function of sample size n. KGlasso (Kronecker graphical 
lasso) uniformly outperforms FF (flip-flop) algorithm and standard Glasso algorithm for all n. Here, p = 20 and / = 10. 




Fig. 11. Normalized RMSE performance for covariance matrix as a function of sample size n. KGlasso (Kronecker graphical 
lasso) uniformly outperforms FF (flip-flop) algorithm and standard Glasso algorithm for all n. Here, p = 20 and / = 10. 
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of ©o is d = 10,000, which was too large for implementation of standard Glasso. Figures 13 and 14 



compare the root-mean squared error (RMSE) performance in precision and covariance matrices as a 
function of n. As expected, KGlasso outperforms both naive Glasso and FF over the range of n for both 
the covariance and the inverse covariance estimation problem. As expected, the FF algorithm suffers in 
the small sample regime. KGlasso outperforms FF in this regime since it exploits sparsity in addition to 
Kronecker structure. 

For n = 10, there is a 69% 5.09 dB) RMSE reduction for the precision matrix and 35% RMSE 
reduction for the covariance matrix when using KGlasso instead of FF. For n = 100, there is a 41% 
2.29 dB) RMSE reduction for the precision matrix and 26% RMSE reduction for the covariance 
matrix. For the small sample regime, there is approximately a 5.09 dB reduction for the precision matrix, 
which is a significant performance gain. 



D. Example 4 

Here, the true covariance matrix factors Xo = Ag 1 and Yo = Bq 1 were unstructured randomly 
generated positive definite matrices. First, p random nonzero elements were placed on the diagonal of a 
square pxp matrix C. Then, on average p nonzero elements were placed on the off-diagonal and symmetry 
was imposed. On average, a total of 3p elements were nonzero. The resulting matrix C was regularized 
to produce the sparse positive definite inverse covariance Yo = C + plf, where p = 0.5 — A TO j n (C). 

We also compare KGlasso to a natural extension of the FF algorithm that accounts for both sparsity 
and Kronecker structure. The flip-flop thresholding method (FF/Thres) that we consider consists of first 
computing the FF solution and then thresholding each estimated precision matrix. To ensure a fair 
comparison we set the threshold level of FF/Thres that yields exactly the same sparsity factor as the 
KGLasso estimated precision matrices. 

For n = 10, there is a 72% 5.53 dB) RMSE reduction for the precision matrix and 41% RMSE 
reduction for the covariance matrix when using KGlasso instead of FF. For n = 10, there is a 70% 
(w 5.23 dB) RMSE reduction for the precision matrix and 62% RMSE reduction for the covariance 
matrix when using KGlasso instead of FF/Thres. For n = 100, there is a 53% 3.28 dB) RMSE 
reduction for the precision matrix and 33% RMSE reduction for the covariance matrix when using 
KGLasso instead of FF. For n = 100, there is a 50% (« 3.01 dB) RMSE reduction for the precision 
matrix and 41% RMSE reduction for the covariance matrix when using KGLasso instead of FF/Thres. 
For the small sample regime, there is approximately a 5.53 dB reduction for the precision matrix, which 
is a significant performance gain. 
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Fig. 12. Sparse Kronecker matrix representation. Left panel: left Kronecker factor. Right panel: right Kronecker factor. 



Max number of iter. - 20, Trials - 50, (p,f)-(100,100) 




Fig. 13. Normalized RMSE performance for precision matrix as a function of sample size n. KGlasso (Kronecker graphical 
lasso) uniformly outperforms FF (flip-flop) algorithm for all n. Here, p — 100 and / = 100. For n = 10, there is a 69% RMSE 
reduction. 



Max number of iter. - 20, Trials - 50, (p,f)-(100,100) 
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Fig. 14. Normalized RMSE performance for covariance matrix as a function of sample size n. KGlasso (Kronecker graphical 
lasso) uniformly outperforms FF (flip-flop) algorithm for all n. Here, p = 100 and / = 100. For n = 10, there is a 35% RMSE 
reduction. 
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Fig. 15. Sparse Kronecker matrix representation. Left panel: left Kronecker factor. Right panel: right Kronecker factor. As the 
Kronecker-product covariance matrix is of dimension 10, 000 x 10, 000 standard Glasso is not practically implementable for 
this example. 



Max number of iter. = 100, Trials = 40, (p,f)=(1 00,100) 
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Fig. 16. Normalized RMSE performance for precision matrix as a function of sample size n. KGlasso (Kronecker graphical 

lasso) uniformly outperforms FF (flip-flop) algorithm and FF/Thres (flip-flop thresholding) for all n. Here, p = f = 100 and 

Nmc = 40. The error bars are centered around the mean with ± one standard deviation. For n = 10, there is a 72% RMSE 

reduction from the FF to KGLasso solution and a 70% RMSE reduction from the FF/Thres to KGLasso. 

Max number of iter. = 100, Trials = 40, (p,f)=(100,1 00) 
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Fig. 17. Normalized RMSE performance for covariance matrix as a function of sample size n. KGlasso (Kronecker graphical 
Ja^o^^njfoijrjlx outperforms FF (flip-flop) algorithm for all n. Here, p = f = 100 and Nmc = 40. The error bars are cfjge^p^ 
around the mean with ± one standard deviation. For n — 10, there is a 41% RMSE reduction from the FF to KGLasso solution 
and a 62% RMSE reduction from the FF/Thres to KGLasso. 




22 



We finally remark that the benefit obtained in the reduced convergence rate is not only due to the 
covariance estimation method chosen, but to the problem it addresses as well-i.e. the assumed true 
covariance structure. 



E. Empirical Rate Comparison 

Next, we illustrate the rates obtained in for the dimension setting p(n) = f(n) = \8n a ~\, where 
a G {0.1,0.2,0.3}. According to the theory developed, for large n, the MSE converges to zero at a 
certain convergence rate. The predicted rates of FF and KGlasso are fitted on top of the empirical MSE 



curves by ensuring intersection at n = 1000. Fig. 18 shows that the empirical rates match the predicted 
rates well. 



KGlasso & FF 



O FF: Emp. a=0.1 
•* — FF: Pred ct=0.1 
■■* '■■■FF: Emp tx=0.2 
•B — FF: Pred. a=0.2 
• + ---FF: Emp. a=0.3 
•♦•"FF: Pred. a, = 0.3 
—® — KGL: Emp. a = 0.1 
- — KGL: Pred. a = 0.1 
-" — KGL: Emp. a = 0.2 
-B— KGL: Pred. a = 0.2 
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KGL: Pred. a = 0.3 
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Fig. 18. Precision Matrix MSE convergence as a function of sample size n for FF and KGlasso. The dimensions of the 
Kronecker factor matrices grow as a function of n as: p(n) = f(n) — [8 ■ n a ] . The true Kronecker factors were set to identity 
(so their inverses are fully sparse). The predicted MSE curves according to Thm.[4]and Thm.[5]are also shown. For both KGlasso 
and FF, the predicted MSE matches the empirical MSE well, thus verifying the rate expressions J17[ > and i | 1 Sfr . 



We also show a borderline case p = f = [n 6 ] . In this case, according to Thm. [4] and Thm. g the FF 
diverges (MSE increases in n), while the KGlasso converges (MSE decreases in n). This is illustrated in 



Fig. 19 Our predicted rates are plotted on top of the empirical curves. 
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Fig. 19. Precision Matrix MSE as a function of sample size n for FF and KGlasso. The dimensions of the Kronecker factor 
matrices grow as a function of n as: p(n) = f(n) = [n ' 6 ] . The true Kronecker factors were set to identity (so their inverses 
are fully sparse). The predicted MSE curves according to Thm. [4] and Thm. [5] are also shown. As predicted by our theory, and 
by the predicted convergent regions of (n,p) for FF and KGlasso in Fig. [5] the MSE of the FF diverges while the MSE of the 
KGlasso converges as n increases. 



IX. Conclusion 

We established high dimensional consistency for Kronecker Glasso algorithms that use iterative l\- 
penalized likelihood optimization that exploit both Kronecker structure and sparsity of the covariance. 
A tight MSE convergence rate was derived for KGlasso, showing significantly better MSE performance 
than standard Glasso ifTTl . I3l and FF 0. Simulations validated our theoretical predictions. 

As expected, the proposed KGlasso algorithm outperforms other algorithms (Glasso, FF) that do not 
exploit all prior knowledge about the covariance matrix, i.e., sparsity and Kronecker product structure, 
that KGlasso exploits. The theory and experiments in this paper establish that this performance gain is 
substantial, more so as the variable dimension increases. Furthermore, as compared to a simple thresholded 
FF algorithm, which does account for both sparsity and Kronecker structure, KGlasso has significantly 
better estimation performance. 
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Appendix A 
Proof of LemmaQ] 

Proof: 

1) Let 6 € (0, 1). Let Xi,X2 G Then, by the properties of the Kronecker product and trace: 

tr(((0X 1 + (l-0)X 2 )®Y)S„) 

= 0tr((Xi ® Y)S n ) + (1 - 0)tr((X 2 Y)S n ) 

The function ff(Xi) := — logdet(Xi) is a convex function in Xi over the set S^_ + |[T9l . By the 
triangle inequality: 

|0Xi + (1 - 0)X 2 |i < fl|Xi|i + (1 - ^(Xad 

Finally, the sum of convex functions is convex. The set S+ + is a convex set for any p G N. The 
other half of the argument follows by symmetry. 



2) By symmetry we only need prove that ( 12 1 is the dual of min Ye s/ </x(X, Y). By standard duality 



relations between l\ and norms |[T9ll and symmetry of Y: 

|Yh = max tr(YU) 

UeS/:|UU<l 

The maximum is attained at Ujj = | Y ' ,:, | ^ or *>i / ^ an( * at = ^ ^ or = ^ - Using this 
in (|9]) and invoking the saddlepoint inequality: 

min tr((X <8> Y)S n ) -plogdet(Y) +pAy|Y|i 



Ye5 / 



++ 



min max |tr((X ® Y)S n ) — plogdet(Y) 



y~es f ++ |uu<a* 



+ 



ptr(YU)} 



> max min jtr((X (g) Y)S n ) - plogdet(Y) 



+ ptr(YU)} (19) 



When the equality in (19) is achieved, (U, Y) is a saddlepoint and the duality gap is zero. Rewrite 



the objective function, denoted J\(-, •), in the minimax operation (19i: 

J A (X, Y) := tr((X ® Y)(S n + U(X))) - plogdet(Y) 
where U(X) = P?0j. Define M = S„ + U(X). To evaluate min Y6S / Ja(X, Y) in JliJ, 



-l 



we invoke the KKT conditions to obtain the solution Y = ( ^ Yji j=i XyM(j, i)) ) . Define 
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W = Y as the dual space variable. Using this in ( 19 1: 



max {p log det ( W) + pf} (20) 

|W-i£L=i X *.S § nO-,i)U<Ar 

pU 



where the constraint set was obtained in terms of W by observing that U(X)(j, i) = ^ql(j = i). 



and /(•) is the indicator function. It is evident that (20 1 is equivalent to (11 



3) It suffices to verify that the duality induced by the saddle point formulation is equivalent to 
Lagrangian duality (see Section 5.4 in |[T9l ). Slater's constraint qualification (see Section 5.3.2 
in |[T9l ) trivially holds for the convex problem min Yg 5/ Ja(X,Y) and and the corresponding 
convex problem min Y6S ./ j\(X,Y). Since the objective function of each dual problem has an 
optimal objective that is bounded below, Slater's constraint qualification also implies that the dual 
optimal solution is attained. 

4) From lH, it follows that if S n is p.d., each "compression step" (see lines 6 and 8 in Algorithm [l} 
yields a p.d. matrix. Combining this with the positive definiteness of the Glasso estimator (3), we 
conclude that the first subiteration of KGlasso yields a p.d. matrix. A simple induction, combined 



with the fact that the Kronecker product of p.d. matrices is p.d., establishes that ( 11 ) and ( 12 1 are 
p.d. 



Appendix B 
Proof of TheoremQ] 

Proof: Recall that the basic optimization problem ([3]) is 

min Jx(X,Y) 

xe,s* + ,Yes£ + 

Let J* := inf XgS p YeS f ^a(X, Y) be the optimal primal value. Note that > — oo when S„ £ S^ + . 
Now, consider the first step in Algorithm [l] Fix X = x( fc_1 ) and optimize over Y £ Invoking 
Lemma |TJ we have Y( fc ) = argmin YeS / ./^(X^ -1 ), Y). Note, by induction Y( fc ) remains positive 
definite, assuming S n is positive definite and X^ ^ is positive definite. Considering the second step in 
Algorithm |TJ we fix Y = Y^ fe ) and obtain X^ = argmin X eS^ + <7\(X, Y( fc )), so that 

J A (XW,Y< fc )) < J^X^- 1 )^)) < .^(X^-^Y**- 1 )) (21) 

By induction on the number of iterations of the penalized flip-flop algorithm, we conclude that the 
iterates yield a nonincreasing sequence of objective functions. Since Xx |X|i, Ay | Y|i > 0, we see that 
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the maximum likelihood objective function provides a lower bound to the optimal primal value 

JxO^KGlasso, YxGlasso) > J\ > J\(^ML,^Ml) > ~ OO (22) 

Thus, the sequence {j[ k) :k>0} forms a nonincreasing sequence bounded below (since for n > pf, the 
log-likelihood function is bounded above by the log-likelihood evaluated at the sample mean and sample 
covariance matrix). The monotone convergence theorem for sequences 1121! implies that {j[ k) } converges 

(oo) (k) 

monotonically to = inf^ . By the alternating minimization, we conclude that the sequence of 
iterates {(X^, Y^)}^ converges since the minimizer at each Glasso step is unique. ■ 

Appendix C 

SUBDIFFERENTIAL CALCULUS REVIEW 

As sparse Kronecker Glasso involves non-smooth objective functions, we review a few definitions and 
facts from subdifferential calculus [22 J. 

Definition 1. By J-attentive convergence denoted as, x n -4- x, we mean that: x n — > x with J(x n ) — > J(x) 
as n — y oo. 

The role of J-attentive convergence is to make sure that subgradients at a point x reflect no more than 
the local geometry of epi(J) around (x, J(x)). 

Definition 2. Consider a proper lower semicontinuous (LSC) function g : ~R d — > M. U {+oo}. Let x be 
such that J(x) < oo. 
For v € R d , 

a) v is a regular subgradient of J at x (i.e., v G 9J(x)j if lim infx^x^^x ^^~njx^Sl ^ x ~ x ^ > 0. 

b) v is a general subgradient of J at x (i.e., v G <9J(x)j if there exists subsequences x 11 4 x and 
v n 6 <9J(x n ) such that v n -> v. 

Let x be such that J(x) < oo. It can be shown that <9J(x) = limsup xJ ^_5J(x), 5J(x) C <9J(x) and 
both sets are closed. 

Define the set of critical points Cj := {x : G <9J(x)} = C Jjmin U C JjSad die U Cj jma:c , where Cj jmiri 
contains all the local minima, Cj^ sa ddie contains all the saddle points and Cj jmax contains all the local 
maxima. 

Definition 3. Let A C ]R n . Define the distance from a point x 6 R n to the set A as d(x, ^4) := 
infagA ||x - a|| 2 . 
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Appendix D 
Properties of objective function J a 

The following set of properties will be used in Lemmas [2j [3] and Theorem [2] 

Property 1. 1. Jo : M. d — > R is continuously differentiable (i.e., fo £ C 1 ) 

2. V Jo : W 1 — > M. d is uniformly continuous on bounded subsets B C M. d 

3. Ji : M. di ->iU {+00} is proper^and lower semicontinuous (LSC), for i = 1, . . . , k 

4. r\i : ~R d — > M + is uniformly continuous and bounded on bounded subsets B C M. d , for i = 1 , 

5. J A is bounded below-i.e. J^ > — 00 

6. J\ is strictly convex in at least one block (for all the rest of the blocks held fixed) 
where J^ = inf x eS d t J A (Xi, . . . , X^) is the optimal primal value. 

Appendix E 
Lemma|2] 

Lemma 2. Given the notation established in Definition [2] and J\ given by we have: 

<9J A (xi, . . . ,x fc ) = xf =1 {V Xi Jo(xi, . . . ,x fc ) + <9Ji(xj) 
+ Ai%(xi)} 
= x J A L 1 {9 Xi J A (xi,...,x A; )} 
where d Xi J\(xi, . . . ,x&) is the partial differential operator while all {xj : j 7^ i] are held fixed. 

Proof: First note that we have: 

k 

<9J A (xi, ... ,Xfe) = VJ (xi, ... ,Xfe) + d{^2 Ji{*i) 

i=i 

k 

i=i 

k k 

= VJ (xi, . . . ,x fc ) + d{^2 Ji(*i)} + d(%2*iVi{Xi)} 

i=l i=l 

= VJ (xi,...,Xfc) + x* =1 {0Ji(x i )} + xf =1 {Ai%(xj)} 
= x* =1 {V Xt Jo(xi, . . . ,x fc ) + aji(xj) + Aj%(xj)} 



3 A function J : X -> R U {±00} is proper if dom(J) = ji £ X : J (a;) < 00} / and J(x) > -00, Vt G X. 
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where (24) follows from Property [T] and Exercise 8.8(c) in 11221 . (25 1 follows from Corollary 10.9 in 



221 . d26b follows from Proposition 10.5 and Equation 10(6) p.438 in [22 ] since Aj > 0, and finally (p7]l 



follows from Minkowski sum properties. 

Appendix F 
Lemma [3] 

Lemma 3. Let m denote the iteration index. For m G N, define: 

l x l J •- Vx^olXi ,x 2 ...,x fc J 

- V Xl Jol^i ,X 2 . . . ,X fc ) 

(x 2 ) .- V X2 J (x 1 ,x 2 . . . ,x fc ) 

m „m „m— 1 „m— 1 N 



-V X2 J (xr,x^,x 



3 



( x -)° :=V Xj J (xr,x 2 rt ...,xn 

-V x ,Jo(xr,...,x-x-- 1 1 ...,x-- 1 ) 

(x?)° := 

77j<?«, ((x™)°, . . . , (x™)°) G 9Ja(x™, . . . ,x™). A/so, /or a// convergent subsequences (x mj )j of the 
sequence (x m ) m , we Ziave 

d(0, d Ja(x™ j , . . . , x™ 3 )) ->• as j -)■ oo 

Proof: From Algorithm |2j we have: 

x^ GargminJA^x™- 1 ,...^™- 1 ) 
x," G argminJA^x^x-- 1 ,...^- 1 ) 

x 2 



x^ G argminJ A (xY l ,...,x^_ 1 ,x fe ) 

The first subiteration step of the algorithm implies that G <9 Xl Ja(x™, x^ - , . . . , x™ -1 ), the second 
subiteration step implies G <9 X2 Ja(x™, x™, x™ _1 , . . . , x™ _1 ), etc. Rewriting these using Lemma kl we 
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have: 

o g v Xl Jo^x™- 1 , . . . ,x™- 1 ) + ajx(xr) + \idm(*T) 

O G Vx.JoCxr^r.x-- 1 , • • .,xr ! ) + a/ 2 (x^) + A 2 r/ 2 (x^) 

o g Vx^x^xg 1 , . . . ,xn + a/ fc (xn + A fe a%(x^) 

This implies that for i = 1, . . . , k: 

(xD° G V^JbCxr,^, ■ • ■ ,*D + ^(xD + Ai%(xT) 

It is important to note that <9r?j(x) 7^ 0, Vx G M d \ for i = 1, . . . , k, as a result of property [l]4. To see 
why, apply Corollary 8.10 in E2l since r)i is finite and locally LSC at every point in its domain. This in 
turn implies ((x^) , . . . , (xj™)°) G <9Ja(x^, ... ,x™) by Lemma [5] 

Now, take an arbitrary convergent subsequence (x™ J , . . . , x™ 3 )j of (x™, . . . , x^) m . The convergence of 
(x x , . . . , x fc )j implies the convergence of (x x , x 2 , . . . , x fc jj, and (x x , . . . , x^ , x i+ \ , . . . , x fe 
for i = 2, . . . , fc— 1. Taking j — )• oo and using properties [I] 2, we see that lim^oo d(0, dJ\(x.™ J , . . . , x™ 3 )) = 
since lim^oo ((x™ 3 )°, . . . , (x™ 3 )°) = (0, . . . , 0). 

■ 

Appendix G 
Proof of Theorem[2] 

Proof: 

1) Let L(x°) = L(x.®, . . . , x°) be the set of all limit points of (x m ) m >o starting from x°. The block- 
coordinate descent algorithm, Algorithm [2j implies 

JoCx^x™- 1 , . . . , x-- 1 ) + Jl(x?) + Mrii(x?) 
< Max,*.™- 1 , . . • ,x™" 1 ) + Ji(ai) + Ai^iK) 

for any a\ G M d K Now, assume there exists a subsequence (x mj )j of (x m ) m that converges to x*, 
where x* is a limit point. This implies that (x™ 3 , x™ 3-1 , . . . , x™ 3_1 ) — > x* as j — > oo. The above 
inequality combined with properties [T] 1 and[T]4 (i.e. the continuity Jo and rn) then implies that 

limsupJi(x™ 3 ) + J (xi, ...,x* k )< Ji(ai) 

j'-KX) 

+ J (ai,X2,. . . ,x]t) + \i(vi(ai) - mi^-i)) 
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for all a\ G M d *. Taking ot\ = xf then yields limsup^^ Ji(x™ 3 ) < Ji(xJ). Using the lower semi- 
continuity property of Ji (property[I]3), we have liminfj^oo Ji(x™ J ) > Ji(x^). Thus, linij_ s>00 Ji(x™ 3 ) = Ji(x\). 
By a similar line of reasoning, it can be shown that Jj(x™" J ) — > </i(x*) as j — > oo, for i = 
1, . . . , k. As a result, X)i=i ^( x i™ J ) — ^ Ei=i <M x i) as i ~~ ^ 00 ■ Since Jo(-) is jointly continuous, 
J (xfV . . ,x^) J (x*, . . ,4). By continuity of Vi (-), ^ti A^( X D -> £ti A^( x *)- 
Thus, Ja(x^) J A (x*) as j 00. 

Now, Lemma [3] implies that ((x mj )°) G dJ\(x ml ). Since the subsequence (x mj )j is convergent, 
by Lemma [3] we have (x mj )° — > as j — > 00. As a result, since dJ\(x m: >) is closed (see Theorem 
8.6 in E21) for all j, we conclude that x* G Cj. Thus, L(x°) C Cj. 
We have thus proved that limit points are critical points of the objective function. 
We can rule out convergence to local maxima thanks to property [T]6. Let us show this rigorously. 
Assume there exists a local maximum at x' = (x' l5 . . . ,x' fc ). Then, there exists r > such that 
•A( x ) < ^a( x ') f° r a ll x suc h that ||x — x'|| 2 < r. Fix Xj = x'^ for all i 7^ 1. Without loss of 
generality, assume J\ is strictly convex in the first block. Since strict convexity is maintained through 
linear transformation, without loss of generality, assume d\ = 1. Let e < r. Define = x\ — e 
and X2, e = x'i + e- Define xg = 9x\^ e + (1 — 9)x2, e , where 9 G (0, 1). Since Ufa^x^i] — x'|| 2 = 
\xg — x' x \ = e(l — 26) < r, by the local maximum definition, there exists e G (0, r) small enough 
such that 

9J\(x 1>e , x^) + (1 - 9) J\(x 2 , e , x^) < Jx{x e ^ x ) 

for some 9 G (0, 1). Since e > 0, we have x\^ 7^ X2, e , and this contradicts strict convexity. Thus, 
there are no local maxima. 

Next, we use the non-existence of local maxima and continuity of J\ to rule out convergence to 
saddle points. Assume there exists a saddlepoint at x s . Then, by definition, G Ja( x s) an d x s is 
not a local maximum or a local minimum. Since x s is not a local minimum, for all e > 0, there 
exists a point x' such that ||x' — x s || 2 < e and Ja(x s ) > Ja(x'). By continuity, it follows that there 
exists 5 > such that for all x satisfying ||x — x'|| 2 < 5, we have Ja(x s ) > Ja(x), which implies 
that x s is a local maximum. This is a contradiction and thus, x s is a local minimum. So, no saddle 
points exist. 

Theorem [T] implies that £(x°) is nonempty and singleton. 

4 An alternative way to get a contradiction is to assume there exists a strict local maximum and use only convexity, instead 
of strict convexity. 
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2) We show that if we do not start at a local minimum, strict descent follows. Let //(•) denote the 
point-to-point mapping during one iteration step, i.e., x m+1 = ^(x m ). We show that if x° ^ Cj, 
then L(yp) C Cj >m i n . The result then follows by using the proof of the first partj^] To this end, 
let x be a fixed point under fi, i.e., /x(x') = x . Then, the subiteration steps of the algorithm 
yield G 3 Xi Ja(x 1; . . . , x fc ) for i = l,...,k, which implies G dJ\(x ), i.e., x G Cj. The 
contrapositive implies that if x ^ Cj, then Jx(/z(x)) < J\(x) (strict descent). A simple induction 
on the number of iterations then concludes the proof. 



Appendix H 
Lemma|4] 

The following technical lemma will be used in the proof of Lemma [5] 

Lemma 4. Let z ~ N(0, Aq ® Bo), where Ao G 5?.+) Bo G <$++• 77?en, /or m > 0, we have the moment 
bound: 

r ' p 



E 



m+2- 



^ ([z]( i _ 1)/+A .[z]( i _ 1)/+i - [Ao]ij[Bo]fe,j) 

m+2 



< (2m + 2)!!p max [B ]fe,fc||X|| 2 || A || 2 

\i<fc</ 

Remark 4. f/ie symmetric X G 5 P case, f/ze bound in Lemma |4] can #e tightened to 

m+2- 

E X] Xi J ([ z ](i-i)/+fc[ z ](i-i)/+« - [Aokj[B ]fc,« 

< (2m + 2)!!( max [B ] fc , fc ) m+2 ^((XA ) m+2 ) 

i<fc</ 

Proof: 

Consider the index set {{h,ji}, {i2,h}, ■ ■ ■ , {im+2, jm+2}}- Define groups G k = {i k ,jk} for k = 
1, . . . , m + 2. Let the generic notation ir(-) denote the permutation operator of a set of indices. 

Define the set of indices M m +2 = M m +2(ii, ji, ■ ■ ■ ,im+2, jm+2) as the set containing sequences 
(ii, Ji, . . . , I m+2 , Jm+2) satisfying the properties: 

1) {h,J\, ■ ■ .,I m +2,J m +2} is a permutation of the index set {h,ji, . . . ,i m +2, jm+2} 
-i.e. Ji, . . . ,I m +2, Jm+2} = 7r({ii,ji, . . . , i m +2, jm+2}) 



5 The first part of the proof showed Cj = Cj,, 



February 13, 2013 



DRAFT 



32 



2) For each q £ {1, . . . , m + 2}, indices I q and Jg must belong to disjoint groups {Gk}™=i 

3) Suppose a sequence {ii, Ji, . . . , I m +2i Jm+2} satisfies the first two properties. Then, add it to 
M m+ 2 and M m+2 does not contain (block-permuted) sequences of the form 

MMih, Ji}), vr({/ 2 , J 2 }), . . . , 7T({/ m+2 , J m+2 })})} 
It can be shown that card(M m+2 ) = (2m + 2)!!. 
As an illustrative example, consider the case m = 1. 

Example 1. For m = 1, f/ie sef M m+2 contains the following 4!! = 8 elements: 

{{h,h}, {h,h}}, {{h, «2>, {ii, J3}, {j'2, ^3}}, 

{{n, J2>, 0'l> »3>, {«2, J3}}, {{n,i2}, {ji, J3/, {>2, «3/}, 

{{n,^}, {ii, i2>, {i2,i3>}, {{h, «3>, {ii, J2}, {«*2,i3}}, 
{{hj3}AhJ2}A i 2,h}},{{h,j3}Ah,h},{h,h}}- 
Of course, other equivalent possibilities for M m+2 are possible. 

Note that tr((XA ) m+2 ) > for all m > 0. From Isserlis' formula H3, we have: 

m+2- 



E 



Xjj ([z](i_ 1 )/+jfe[z]y_ 1 ) /+i - [A ]i,j[B ]fc,/) 
p p 

«l,jl = l im+2,im+2 = l 

ro+2 

x e [ n ([ z ](<--i)/+fcH(j«.-i)/+» - [ A o]* Q j Q [Bo]fc,« 



a=l 



< ( max [B ]fc,fc 
i<fc</ 



p p 

il,il=l im+2,im+2=l 

m+2 



{^,^};= + i 2 e A ' / -+2 <? =1 
< ( max [B ] fc , fc r +2 (2m + 2)!!p(||X|| 2 ||A || 2 r +2 
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Appendix I 
Lemma[5] 

The following lemma will be used in the proof of Theorem [3] Theorem |4] and Theorem [5] The method 
of proof is by moment generating functions. A similar bound can be obtained under the same set of 
assumptions using standard decoupling arguments and Gaussian chaos Talagrand-based bounds. 

Lemma 5. Let X be a pxp data-independent matrix. Define the linear operator T as T(X) = B(X _1 ), 
where B(-) is defined in ([§]). Assume max/jBo^fc, ||X|| 2 , ||Ao|| 2 ore uniformly bounded constants as 
p,f -)■ oo. Define B* := ^^^B . Let c, r > 0. Define ^(u) = E"=o { ^r^u m f\ Let C := 
" J+rl '"' ; " u -' :) < ]n!r(m Z f ^ Then, with probability 1 



- log(max(/,n)) ™» y.vwtu, x max(/ n)c 



T(X) -B.U < W#(-i- )m ax(2 !c)l /Mmax^n)) 



v 2 + r V 
w/iere k = ma,Xk[&o]k,k ■ ||X|| 2 || Arj|| 2 . 

Remark 5. Choosing c < 2 in Lemma [5] f/ze oesf relative constant is obtained by taking r to infinity, 



which yields J 40( 2x7) max(2, c) — > 4. 



Remark 6. For f/ie case of symmetric matrices X £ S p , the constant k can be improved to max/%[Bo]fc,fc • 
l|XA || 2 . 

Proof: This proof is based on a large-deviation theory argument. Fix (k, I) 6 {1, . . . , /} 2 . Note that 
E[T(X)] = B*. First we bound the upper tail probability on the difference T(X) — B* and then we turn 

6 The double factorial notation is defined as 



ml'. = < 



m ■ (m — 2) 3-1 if m > is odd 

m ■ (m — 2) 4-2 if m > is even 

1 if m — — 1 or m = 



7 If p — f — n c for some c' > 0, this condition will hold for n large enough. 
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to the lower tail probability. Bounding the upper tail by using Markov's inequality, we have 
Pr([T(X)] M -[B*] M >e) 



Pr (it^ASnti,^- 1 ^ 



n p 



Pr ( ^2 ^2 x ^ i 

m=l i,j=l 



Bo]/c,/ > e 

z ™J(i-i)/+tW(j-i)/+i 



- [A ]ij[B ]fe^ ) > npe^ 

n p 

Pr(exp{£^ X i,i([ Z m](i-l)/+fc[ z m](i-l)/+i 
m=l i,j=l 

- [A ]ij[B ] k A} > exp{tnpe} 



m=l 

< e~ tnpt ( E 



n p 

< e -tnpe E ^ "Q exp | t ^([z^-Df+k^j. 

}] 



(28) 



where we used the i.i.d. property of the data in (28) and y( fc '0 := Ylij=i ~X-i,j{[ z ](i-X)f+k[ z ](j-i)f+l ~ 
[Ao]i,j[Bo] fc ,/)- Define p 2 xl random vector z( fc '') as [z( fc ' Z )] (i _ 1)p+i := [z} {i _ 1)f+k [z} {j ^ 1)f+l -[A }ij [B }k,i 
for 1 < z, j < p. Clearly, this random vector is zero mean. The expectation term inside the parenthe- 



ses in (28 1 is the MGF of the random variable y( fc '') = vec(X) T z( fc '^. For notational simplicity, let 
0y(i) = E[e ty ] denote the MGF of a random vector Y. As a result, E[e* Y<M) ] = c/> y(fc>!) (t). 
Performing a second order Taylor expansion on (py^.i) about the origin, we obtain: 

<?>yCM)W = <Py<M)(0) + 



-i + 



dt 2 



for some 5 G [0,1]. Trivially, y(M) (O) = 1 and d ^ (k d f (0) = E[vec(X) T z( fc '')] = 0. Using the linearity 



of the expectation operator, we have: 

dT 2 



E[(yW))V^ <M> ] 
(sty 



= y ^ E [(vec(X) T z( fc '')) m+2 ] 
z — ' ml 

m=0 

Using the elementary inequality 1 + y < e y for y > — 1, and after some algebra, we have: 

oo 

nln(^ y(M) (t))< ^t 2 YT m (t) 



(29) 



m=0 
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where T m (t) := E[(vec(X) T z^)) m + 2 ]. Note that 

^m+2 - . P 



t 2 T m (t) < -^j-E^ 53 x ij([ z ](i-i)/+fe[ z ](i-i)/+i 



t 



\ m+2 
Ao]tj[Bo]fc,j[) J 

m+2 P ; ' 



/??! 



ilj'l=l « m + 2jn, + 2 = l 

m+2 

X E [ II ([ Z ](i„-l)/+fc[ Z ](i„-l)/+« - [ A oL,j«[ B o]/M 



a=l 
j.m+2 

< -(2m + 2)\\p( max [Bo]jt ,fc||X|| 2 || A || 2 

nr. ■ " ' ' 



,m+2 



(2m + 2)!!_ m+2 



to! 



i<fc<f 
(tk) m+2 p 



(30) 



where (30l follows from Lemma |4||j Also, we defined k = maxi<k<f[Bo]k,k • ||X|| 2 || Ao|| 2 - Summing 
the result over m, and letting u := tk > 0, a m (u) := ^|pw m , tp(u) := J2m=o a m(u), we obtain: 



t 2 53 T m (t) < pu 2 ^(t 



m=0 



u=tk 



(31) 



By the ratio test [21], the infinite series Ylm=o a ™( u ) conver g es if it < 1/2. To see this, note 

a m+ i(u) 
p:= Inn rT - 

m->oo a m (ltj 

(2m + 4)!! to! 
— lim ti 

m->oc (2m + 2)!!(m + l)! 

1 + 2/m 
= lim 2u ; — 

m->oo 1 + l/m 

= 2u < 1 



Using (31 1 in (29), and the result in (28 1, we obtain the exponential bound: 

Pr([T(X)] fc ,, - [B*] M > e) 

r np(tk) 2 —. I 
< exp | — tnpe H ^(rfc) j- 

Let i < 1 - and e < 2 rpf V^2+~f)^ < 00 ■ By tne nionotonicity of ip(-), we have: 

^ 2 — 2 

Pr([T(X)] fe> , - [B*]^ > e) < exp { - tnpe + V>(^— ^)} 



(32) 



In the symmetric X case, this bound can be tightened using tr((XA ) m+2 ) < p(||XA || 



\m+2 
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Optimizing (32 1 over t, we obtain t* = _ 2 £ ; Clearly, t* < 1 - . Plugging this into (32i, we obtain: 

k ip(j^:) " (2+T)k 



Pr([T(X)] fcj , - [B,]*,, > e) < exp { - — } 

Define C := _ 2 1 , — . Since ibii^—) < oo, C > 0. Thus, for all e < ttI— ^(th- we have 

P([T(X)] fc>1 - [B*] M > e) < e"^ 2c (33) 

where C > is independent of n,p, f. 
Next, we bound the lower tail: 

Pr([T(X)] fc ,,-E[[T(X)] M ]<-£) 

n p 

= Pr ( E - X i,j([ Z rn}(j-l)f+k[Zm](i-l)f+l 

m=l i,j=l 



- [Ao]i,j[Bo]jfc,i) > npe^ 

where 4>y^,i) is the MGF of Y^ k < l \ Performing a second order Taylor expansion as before, we have: 

0?(».o(-*J = 0y-c*.o(°) Til * + o Jjw * 



m=0 



where 2^(t) := ^pE[(< V ec(X), z( fe ><) >) m + 2 ] = (-l) m T m (t) < T m (t) and 5 G [0,1]. Proceeding 
similarly as above, it can be shown that for all e < ^P^i^pf)^'- 

Pr([T(X)] M - E[[T(X)] M ] < -e) < e~ n ^ c (34) 

where C was defined as before. From (33) and (34), we conclude that for all e < ^hf^iz^)^- 



Pr(|[T(X)] M -E[[T(X)] M ]|>e) 

< Pr([T(X)] fc)l - E[[T(X)] fcjJ ] > e) 

+ Pr([T(X)] w -E[[T(X)] fcj ,] < -e) 

< 2e~ npe2c 
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The union bound over (k, I) £ {1, ... , f} 2 completes the proof. Let us rewrite this. If 4 max(2,c) iog(max(/,n))(2+r) ^ 
np, then with probability 1 — max h n \c 



iT ( x) - E[ T(x)]u < k . ^^i ^mm^m 



. j ., it 1 n to \ /logmax(/,ra) 

^ fc - 2 A/^ 77— ) max 2 > C )W 

2 + r y np 



Appendix J 
Proposition Q] 

Proposition 1. Let S p j n be a d! x d! (where d! = p or d! = f) random matrix such that with probability 
1—^5, |S p j jri — S*|oo < T p j iTl . Assume I]* G S+ + has uniformly bounded spectrum as p, f — > oo (analog 
to Assumption 1 ). Choose X p j n = c ■ r p j^ n for some absolute constant c > 0. Consider the Glasso 



operator G(-, •) defined in {10). Let s = s®^ be the sparsity parameter associated with 0* := S^, , 
Assume \J d! + s ■ r p j :7l = o(l). Then, with probability 1 %, 



|G(S Pi/ , n , A Pi/jn ) - ®4 F < 2 2 ^J" ^ y/d' + s ■ r pJyf , 



as n — ? oo. 



Proof: The proof follows from a slight modification of Thm. 1 in [17], or Thm. 3 in |[T6l . This 
modification is due to the different r p / «■ ■ 

Appendix K 
Proof of Theorem[3] 

Proof: As in the proof of Thm. 1 in Q, let B* = tr ( A "^ t ) Bo and A ^ _ ( tr ( A °^»») )-1 Aq- Note 
that Assumption [I] implies that ||B*|| 2 = ©(1) and || A* || 2 = ©(1) as p, f — > oo. 

For concreteness, we first present the result for k = 3 iterations. Then, we generalize the analysis to 
k > 3 flip-flop iterations. 

Define intermediate error matrices: 

B° = ~B(Ai n u) — B* 
A 1 = A{B(A imt )) - A, 
B 2 = B(A(B(A init ))) - B* 
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Lemma |5| implies (B ^ = Op ( \/ l °n^f ) as n ,P,f — > °°> so we have 



|B°|L = Op 



if 2 logM 
np 



(35) 



From the matrix inversion lemma [24], we have 

B(A^)- 1 = (B. + B )- 1 



B- 1 - B-WB- 1 + A] 



(36) 



where 



Let us show 



k=2 



Ax = ^(— B~ 1 B°) fe B~ 1 
7 2 - 5 logM 



|Ai|| F = P 



np 



(37) 



From the submultiplicative property of the spectral norm, the triangle inequality and (35 1, we have: 

oo 

HAill^ll^-B^BYB- 1 !!, 

k=2 

oo 

^(IIb^iijbViib- 1 ^ 

k=2 

oo 

^SdlB^llallB ^)*!^- 1 ^ 



k=2 



Op 



where we used Assumption 1 and b p j jn = ||B !f i || 2 ||B u || F = Op[ \/ r np M ) m tne bound: 



/ 2 log/ 
np 

111 IIWO 



5 3 , 



fc=2 " Jp 'f< n 

with high probability as p, f, n — > oo (here, C > is an absolute constant). This type of asymptotic 
argument will be used in the remainder of the proof to retain the dominant terms. The inequality || Ai[|_p < 



\/7||Ai|| 2 then implies (37 1. 
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Next we expand vec(A 1 ) - 



1 - 



vec(A 1 ) = jR A vec(B(A init ) x ) -vec(A*) 



= iR A vec(B; 1 ) - jK A (B: T ® B; 1 )vec(B u ) 
+ vec(A 2 ) 

= vec(A(B*) - A*) ^ -vec(A*) 

+ vec(A 2 ) 

where we used (see (106) from @) 

K A (B: T ® B; 1 ) = vec(A,)vec(B,- T ) T 
R B (A; T ® A; 1 ) = vec(B,)vec(A,- T ) T 



(38) 



(39) 



and A 2 is given by 



vec(A 2 ) = iR A vec(Ai) 



R^B^B^ve^B ) 



El) and 



Using (37 1 and (35) and bounds ||Ra|| 2 < Pf\& 



n Z-'0 1 00 



Oplpf 



pt A || 2 =np^o|| F ||B || F < Vp7||E || 2 



log(p/) 



(40) 

(see Lemma 1 in 



|A 2 || F < illR^ll^lAxll^ + yllRAllzllB^II^B ^ 



Op ( (/V 1/2 + y/p) 



logM 



7? 



(41) 



Now, using the triangle inequality in (38 1 and the bounds (41 1 and (35 1, we obtain: 



I A 1 < ||A(B*)- A4 F + 



\A-*\\f\\^* \\f\\ I 

7 



(42) 



where we also used the Cauchy-Schwarz inequality ^(A^B^ 1 )| < || Asll^HB* 1 \\ F < \/pf\\ A* || 2 1| 1 || 2 

O(l). This implies that 



and Lemma 



The growth assumption implies 



P ,j, n _ f 2 p~ 1/2 +yp I log m 
pf- 1/2 +VJ V « 



%fn i s asymptotically negligible and thus: 



\A 1 \\ F = P [(pf 



-1/2 



/) 



logM 



n 



(43) 
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In parallel to (36 1, we expand: 

(A(B(Aj„ it )))' _1 = (A.+A 1 )- 1 

= a; 1 - a; 1 a 1 a; 1 + A 3 

where 

oo 

A 3 = ^(-A~ 1 A 1 ) fe A~ 1 
Then, expanding B 2 in parallel to (38), 



k=2 



vec(B 2 ) = -R B vec((A(B(A mit )))- 1 ) - vec(B*) 
P 

= vec(B(A») - B,) - ^vec(B*) 



P 



+ vec(A 4 ) 



where 



vec(A 4 ) = -R B vec(A 3 ) - -R B (A; T ® A^ 1 )vec(A 1 ; 
p p 



Using similar techniques as above, it follows from (|43j» and ( |45) > 

ll A 3|| 2 

which implies 



Op((p/-" 2 + y7) 2 ^) 
a 3 || f = op (yvw- 1/2 + Vff^) 



Applying the triangle inequality in (47 1 and using d43l, (48 1, we have after some al 



|A 4 |[p < -||R B || 2 ||A 3 || F + ^IIRbII^IA; 1 !! 2 !!! 1 ^ 



op( v /7(pr l / 2 + y/) 2 ^) 



From (46 1, (43 1 and (47 1, we obtain: 



2,i no, a , D n , M A, ' Al)l || Bt || F ~ ||A 



|B 2 || F < ||B(A») -B„|| F + 

HA 1 ! 

< ||B(A*) -B4 F + 
( 

= o P 



P 



4\\f 



P 



A * 1 |l2ll B *ll2 + II^HiT 



(fp- 112 + vW ^ + /f(p/- 1/2 + v7) 2 '° gM 



6(2) 



6(2) 



Op|(/p-" 2 + vp). /logM 



n 



February 13, 2013 



41 



where we used the growth assumption to obtain = ^/p-i/^+^P 



0(1). 



Define the errors Rpi?(3) = Rff(3) — So an d R := S n — So- Using the permutation operator 1Z 
defined in [5], we plug in ( |38] l and ( |46| ) and simplify: 

vec (K(R FF (3))) = vec(vec(A 1 )vec(B ! ,) T ) 

+ vec(vec(A*)vec(B 2 ) T ) + vec(vec(A 1 )vec(B 2 ) T ) 

= P K 1 Hvec(R)+vec(A JJ ) (51) 

where Ap is given by: 

A R = vec(A 2 )vec(B*) r + vec(A*)vec(A 4 ) T 



_ tr(A 2 A; 

P 

and S is a data-independent matrix: 



-vec(A )vec(B ) T + vec(A 1 )vec(B 



2\T 



(52) 



H = P R [-(vec(B )vec(B 1 ) T ®I p2 )P /?A 

+ -(I /2 tg> vec(A )vec(A 1 ) T )K / 2 ip2 P /?£ 



1 

Ft' 



vec(B )vec(A 1 ) T 



® vec(A )vec(B 1 ) T )K p 2 J2 p Ra ] 
From (51 1 and (24) in [5], we have: 

vec(R FjF (3)) = Evec(R) + vec(A R ) 



(53) 



(54) 



where vec(Ap) := P^vec(A^). Since Pp is a permutation matrix, ||Ar|| f = ||Ap|| F . Using this and 
the triangle inequality in ( [52] ): 

\\&r\\f ^ II A 2|IfV7||B*|| 2 + \/pII A *II2II A 4|If 

+ ||A 2 || F ||A- 1 || 2 ||Ao|| 2 x/7||Bo|| 2 + || A 1 1| F || B 2 1| F 
r log AT 



Op Vp/^ 



n 



(55) 



where we used the Frobenius norm bounds in (|41j), (|49j), (|42j) and <[50|). Recall that M was defined in 
the statement of the Theorem. 



From ( |54| ), we obtain: 

Cov(vec(R FF (3))) = HCov(vec(R))3 T + E[Hvec(R)vec(A K ) T ] 
+ E[vec(A i? )(Hvec(R)) T ] + E[vec(A R )vec(A fl ) T ] 



(56) 
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where we used E[vec(R)] = and E[vec(A^)] = 0. By Jensen's inequality and the Cauchy-Schwarz 
inequality, we obtain: 

|tr(E[Hvec(R)vec(A i? ) T ])| < E| < vec(A R ), 3vec(R) > | 
<E||Evec(R)|| 2 ||vec(A R )|| 2 

< ||S|| 2 E[||R|| F ||Afl|| F ] (57) 



From (53 1, it is easy to see that ||S|| 2 = 0(1) as p, f — > oo. Similarly, we have 



|tr(E[vec(A /? )vec(A i? ) T ])| < E[\\A R f F 



Next, we show 



tr (S(S ® £ )3 T ) = tr(A ) 2t ^Il + tr(B ) 2 ^ 

/ P 



tr(Ag)tr(Bg) 



Pf 



Once (|59|) is established, using tr(Ao) 2 < ||Ao|| 2 p 2 and tr(Bg) < /||Bo|| 2 , we have: 



tr S(S (8)Eo)H J < (p 2 + f 2 )\\V 



Similarly, we have 



tr (S(E„ ® S )H T ) > (p 2 + / 2 )A mm (A ) 2 A min (B c 



J 0||2 



Let us prove (59 1 now. Note that 

9 

tr(S(S ® S )H T ) = vec(H T H) r vec(S ® E ) = J^Tj 
where T, will be defined and calculated below. 



i=l 



(58) 



(59) 



(60) 



(61) 
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We have 

Ti = vec({i(vec(B )vec(Bo X ) T 8> I p2 )P fiA } T x 

{i(vec(B )vec(B l ) T ® V)P /?A }) T vec(S 8) E„) 
= t ^^vec((vec(B 1 )vec(B X ) T ® I p2 )) T x 

vec(P BA (Eo®So)PL) 
= tf( ^ 2 vec((vec(B 1 )vec(B X ) T ® I p2 )) T x 

vec((B ® B ) (8) (A 8) A )) 
= tr(A ) 2 ^^ • vec(B - 1 ) T (B 8) B )vec(B l ) 

= tr(A ) 2 ^^ 

where we used several Kronecker product identities ll24ll and the definition of the permutation matrix 
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X 



X 



The rest of the terms are given by: 

T 2 = vec({i(vec(B )vec(Y ) T I p *)PrJ T 

x {^(I /2 <g) vec(A )vec(Xo) T )K / 2 iP 2P i?B }) T vec(S <8 E ) 

T 3 = vec({i(vec(B )vec(Y ) T <8 V)P/? A } T 

{-^(vec(B )vec(X ) T vec(A )vec(Y ) T )K p 2 i/2 P RA }) T 

vec(S (8) S ) 
T 4 = vec({^(I /2 vec(Ao)vec(X ) T )K / 2 ip2 P ijB } T 

x {i(vec(B )vec(Y ) T ® I p 2)P RA }) T vec(S ® S ) 

T 5 = vec({-(I/2 <8 vec(Ao)vec(Xo) T )K / 2 ip2 P i?B } T 

x {-(I /2 ^ vec(A )vec(A 1 ))K / 2 iP 2P i?B }) T vec(S S ) 

= tr(B ) 2tr( ^ 

P 

T 6 = vec({-(I /a <8 vec(Ao)vec(X ) T )K / 2 ip2 P i?B } T 

x {-^(vec(B )vec(X ) T vec(A )vec(Y ) T )K p 2 i/ 2P^}) T 
x vec(S <8 S ) 

T 7 = vec({-^(vec(B )vec(X ) T ® vec(Ao)vec(Yo) T )K p 2 5/ 2P RA } T 

x {i(vec(B )vec(Y ) T I p 2)P^}) T vec(S S ) 
T 8 = vec({-^(vec(B )vec(X ) T ® vec(A )vec(Yo) T )K p 2 5/ 2P RA } T 

x {^(I /2 ^ vec(A )vec(X )' r )K / 2 iP 2P i?i3 }) T vec(So S ) 
T 9 = vec({-^(vec(B )vec(X ) T ® vec(Ao)vec(Yo) T )K p 2 5/ 2P RA } T 
{^(vec(B )vec(X ) T (8 vec(A )vec(Y ) T )K p 2 i/ 2P^}) T 
vec(S (8 S ) 

where we used the definitions of the permutation matrices Pr b , Pr a , K p 2 j2 and Kp )P 2. 

After some algebra, we observe T 2 = -T 3 = T 4 = -T 6 = -T 7 = -T 8 = T 9 = 1 <MM^-, which 
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implies: 



tr(S(E ® E )H T ) =T 1 + T 5 + T 7 



which in turn yields (59 1. 

From ( |56| ) and the bounds ([57]),(|58]), we have: 

ntr(Cov(vec(R FF (3)))) 



< 



tr(H(S^®S )H^) 

n(2||S|| 2 E[||R|| i ,||A ii || i; ,]+E[||A B |||]) 



Of 



tr(S(Eg , ®S )S r ) 

U ((p/) 3/2 M(^) 3/2 
I n 



p 2 + r- 

+ p/M 2 (^) 2 
n 



}) 



( { P f?' 2 M (logM) 3 / 2 \ 
V P 2 + f 2 V™ J 

o(l) 



(62) 
(63) 



where we used (61 ) in (62). Note that we used the growth bound assumption A/l ^| — = O(^fn) in (63 1 



|^J and the last step follows from the growth assumption 
Thus, in first order, we have: 

tr(Cov(vec(R FF (3)))) 

as p, /, n — > oo, which implies: 



(p/) 3 M 2 (logM) 3 

( p 2 +/ 2)2 



tr(S(So ® S )H 3 



77 



o(n). 



\Kff(3)\\ 2 f = Op 



tr(S(S ® E )H T ) 



as p, /, 77 — >• oo. Combining this with (60 1, we conclude that the proof for = 3 FF iterations is complete. 

Next, we turn to the k > 3 case, where k is odd. For this case, the main arguments will be sketched since 
the details follow from similar analysis as before. We saw above that A.r is asymptotically negligible as 
p, /, n — > oo, so we introduce the approximation symbol "=" that will retain only the first order terms. 

As before, define B° = B(Aj n j t ) — B* and for k > 3 odd, recursively define: 



vec(A fc - 2 ) - iR A vec(B; 1 ) - ^Ra(B; t 



B 4 T i )vec(B fc - 3 ) 



vecfB*- 1 ) S 1 R B vec(A; 1 ) - -R#(A~ T <g> A~^)vec(A. k ~ 2 ) 
P P 



(64) 
(65) 



This follows since 



MS < SWi^M = 0(y/n\. 
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This is a continuation of the recursive pattern that arises when ignoring the higher-order asymptotics-i.e., 



for k = 3, (64 1 becomes identical to (38) when A2 is negligible and (65) becomes identical to (46) when 
A4 is negligible. 

Expanding out vec(A ) and vec (B^ 1 ) for odd k > 3, we obtain: 

1 



vec(A fc - 2 ) - 
,fc-3 s 1 



R^ve^B; 1 ) 
vec(A,)vec(B; T ) T R B vec(A; 1 ) 



Pf 



+ ( 



fc-3 s 1 



Pf 



vec(A,)vec(A; T ) T R yl vec(B; 1 ) 



P/ 

vec(B fe - 1 ) ^ 



vec(A ! „)vec(B !|t T ) T R B vec(A 



it vit 1 



k 



-(^) 



-R#vec(A 
P 

1, 1 



-In 



P/ 



+ ( 



fc-3 s 1 



P/ 



vec(B,)vec(A,- T ) T R A vec(B; 1 ) 
vec(B,)vec(B; r ) T R i? vec(A; 1 ) 



H -vec(B*)vec(B. 

P/ 



-T\T 



where we repeteadly used the identities (39) (see (106) in 0). 

Then, at the Mi odd FF iteration, letting R FF (k) = R FF (k) — So> we have: 

vec(7e(R FF (£;))) ^ vec(vec(A fc - 2 )vec(B*) T ) 
+ vec(vec(A*)vec(B fe - 1 ) T ) 

where we used Eqs. (22),(23) from ||5]. 



(66) 



(67) 



(68) 



Using (66) and (67) in (68) and simplifying: 

vec(K(R FF {k))) S P^ 1 Svec(R) 
where S was defined in (53 1, R = S„ — Sq> an d Pj? is the permutation matrix defined in (24) from [5]. 



Note that three cancellations occured in the last step. Using (24) from @: 

vec(R FF (k))) = Hvec(R) 



(69) 



Since (69) is identical to (54) when A# is negligible, the same analysis following (54) holds and as a 
result, the same rate holds for odd k > 3. 
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For the k > 3 even case, the analysis changes slightly, but the same rate holds. To see this, first note 
that for k > 4 even, 



vec(K(R FF {k))) vec(vec(A fc - 1 )vec(B,) T ) 
+ vec(vec(A,)vec(B fe - 2 ) T ) 



(70) 



Using (66 1 and (67i in (70 1, we obtain, after some algebra, 



vec(7e(R FF (A:))) -(vec(B )vec(Bo X ) T ® I p2 )vec(R A ) 
+ ^(1/2 ® vec(A )vec(A 1 ) T )vec(R^) 
- ^(vec(B )vec(B 1 ) T 8) vec(A )vec(A 1 ) T )vec(R^) 
= P^ 1 3vec(R) 



where 



H = P 



/ 



(vec(Bo)vec(Yo) T ®I p2 )P^ 



1 



+ ~(I /2 ® vec(A )vec(Xo) i )Kf2^>P RE 



2 

P/ 



(vec(B )vec(Y ) T ® vec(A )vec(X ) T )K / 2 iP 2P i?f 



Thus, vec(Ri?i?(fc)) = Hvec(R). Proceeding similarly as before, we have: 



tr(S(S ® S )H T ) = vec(H J H) T vec(S 



» ^0) 



8=1 
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where 



Ti = vec({i(vec(B )vec(Y ) T 1 p *)Pr a } T 

x {i(vec(B )vec(Y ) T ® V)P i?A }) T vec(S ® E ) 
f 2 = vec({i(vec(B )vec(Y ) T I^)Pfl A } T 

x {-(I /a 0vec(A o )vec(X o ) T )K / 2 ip2 P RB }) T 

x vec(E <8> ^o) 
T 3 = vec({i(vec(B )vec(Y ) T V)P flx } T 

x {^(vec(B )vec(Y ) T vec(A )vec(X ) T )K /2ip2 P RB }) T 
PJ 

x vec(E <S> S ) 
f 4 = vec({^(I /2 ^ vec(A )vec(X ) T )K / 2 iP 2P i?B } T 

x {i(vec(B )vec(Y ) T V)P i?A }) T vec(S E ) 
T 5 = vec({^(I /2 vec(A )vec(X ) T )K / 2 ip2 P i?B } T x 

{-(I /2 0vec(A o )vec(X o ) T )K /2ip2 P i?B }) T vec(So «S ) 
T 6 = vec({-(I /2 ® vec(A )vec(X ) T )K /2ip2 P i?B } T 

x {-^(vec(B )vec(Y ) T vec(A )vec(X ) T )K /2ip2 P RB }) T 

x vec(£ (8> S ) 

T 7 = vec({-^(vec(B )vec(Y ) T ® vec(A )vec(X ) T )K /2jp2 P i?B } T 

x {i(vec(B )vec(Y ) T <g> I p2 )P i?A }) T vec(S E ) 
T 8 = vec({-^(vec(B )vec(Y ) T ® vec(A )vec(X ) T )K /2jp2 P i?B } T 

x {^(1/2 ® vec(A )vec(X ) T )K /2ip2 P i?B })' r vec(So S ) 
f 9 = vec({-^(vec(B )vec(Y ) T (8 vec(Ao)vec(X ) T )K /2jp2 P i?B } T 

x {-^(vec(Bo)vec(Yo) T ^vec(Ao)vec(X ) T )K /2ip2 P RB }) T 

x vec(S <8> S ) 
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After some algebra, we notice f x = Ti, f 2 = T 2 , f 5 = T 5 , % = -2T 2 , f 2 = T 4 , T 9 = 4T 2 and 
T 6 = T 7 = T 8 = %. Thus, Yh=i f i = T 1 + T 5 - 2T 2 . From this, it is easy to see that we have the upper 
and lower bounds: 

tr(S(E ® ^o)H T ) < (p 2 + / 2 )||£ || 2 

tr(S(S ® S )H T ) > {p 2 + f )X min (A ) 2 X min (B ) 2 - 2|| ^ 111 

The remaining of the proof follows similarly as in the k > 3 odd case. The proof of the theorem is 
complete. The same rate holds for the error in the precision matrix (for example, see proof of Thm. [4]). 



Appendix L 
Proof of TheoremH] 

Proof: As in the proof of Thm. 1 in Q, let B* = ^^^B and A* = ( g ^^^ )~ 1 A . Note 
that Assumption 1 implies that HB^ || 2 = 0(1) and ||A*|| 2 = 0(1) as p,f — > oo. For conciseness, the 
statement "with probability 1 — \" will be abbreviated as "w.h.p."-i.e., with high probability. 

For concreteness, we first present the result for k = 2 iterations. Then, we generalize the analysis to 
all finite flip-flop iterations by induction. 

The growth assumptions in the theorem imply 



max < 



P, f, — , 
P 



P + f 



log M < C'n 



(71) 



for some constant C > large enough [^] In fact, the growth assumption in the theorem statement can 



be relaxed to (71) 



Define intermediate error matrices: 

B° = B(Aj n jt) — B* 
A 1 = A(B(A iri j 4 )) — A* 

Define Y* = B" 1 and X* = A^ 1 . Also, define: 

Yi = B(Aj n a) 1 
X 2 = A(B(A imt ))- 1 

10 This constant is independent of p, /, n, but may depend on the constants in Assumption 
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These inverses exist if n > max{^, ^} + 1 (see ESI ). Define the error YlFF(k) = YlpF{k) — for 
k > 2. For notational simplicity, let B™ aa: := max^fBo]^/; and A™ ax := maxj[Ao]j,j, tp T '■= ipi^hr)' 
where ip(-) is defined in Lemma [5] 
Lemma [5] implies that for 



n > 8 ( 2 + r ) logM (72) 



then with probability 1 — A, we have: 



B°|| F < 2V^BriA-> || 2 / (73) 



As in the proof of Thm. 1 in 0, we vectorize the operations ([7]) and ([8]): 

vec(A(B)) = j&AveciB- 1 ) 

vec(B(A)) = -RBvecfA" 1 ) 
P 

where R^ and Rg are permuted versions of the sample covariance matrix ||5]. 



Let e' > 1. Note that from ( [73] ), for 

n > (e^v^BniA-^Aoll^W 1 logM (74) 
with probability 1 — A, from Weyl's inequality, 

A m in(B(Aj n it)) = A m j n (B° + B*) > A m j n (B i)! ) — ||B°|| 2 
> A m m(B*) — ||B ||^ > ^1 — — ~\ \ m i n (B*) 



Thus, w.h.p., 



|Yi — Y*\\p — ||Yi(B(Aj„it) — B*)Y*|| ir 

IIB°II 

/ || V 11 iiv 11 lift II " " F 

— II 1 l|l2ll * II 2 II \\F 



< (l - 7) 1 IIY^II^^B^IIA-^A 



Using Lemma [5} for 



Amin (B* )A m j n (B( A jnji)) 
II 2 



n > 8 ( 2 + r ) bgM (76) 



„ 1/9 /logM 

x/p- 1/2 A/-^— (75) 
n 
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then, w.h.p., 



||Ra|| 2 = sup ||Rav|| 2 <p sup HR^v - Rav| 

l|v|| 2 =l ||v|| 2 = l 

t \\ l T3 < VeC ( B 0),V > 

= pf sup || 7 Rav vec(A )|| c 

l|v|| 2 = l / / 



< 2^Ar 2 ||B || 2 Px/7/^ 



(77) 



Expanding A 1 : 



vec(A 1 ) = yR J 4vec(Yi) - vec(A*) 



tr(B (Yi-Y.)) 
/ 



vec(A ) + vec(A(B*) - A*) 



+ -R A vec(Y! - Y») 
where we used R A = vec(A )vec(B^) T (see Eq. (91) from 0). 



(78) 



Now, using the triangle inequality in ([78]), the bounds (75) and (77 1, the Cauchy-Schwarz inequality, 



we obtain w.h.p. (under conditions (72 1,(74 1,(76 )) ? after some algebra: 

IIA 1 !^ < y^HAolUlBolUlY! - Y»|| F +p|A(B») - A, 

+ i||R A || 2 ||Y 1 -Y*|| jF 



\ogM ~ p-AogM 

V C 2 \/pf 

n n 



(79) 



where 



Ci : = 2v/2Vv max {k(£ ) ^1 - ||@o|| 2 || A 

} 



|2 
init \\2 



xB^IIArlA 



Co 



1 



tr(A A^ t ) 

8^Ar a Bri|Bo|| 2 ||Ar^A o || 2 ||0o||^||Ari t ||^ 



Let ci > 0. For 



n> (i^) 2 plogM 
C x c\ 



then, from (79 1, we have w.h.p. 

wa.% <ci(i + Cl )(yj+pr 1/2 ) 



logAf 
n 



(80) 



(81) 
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Using the permutation operator TZ(-) denned in 0: 

vec (fc(IW(2))) = vec(vec(A 1 )vec(B ;t ) T ) 

+ vec(vec(A*)vec(B°) T ) + vec(vec(A 1 )vec(B°) T ) 



(82) 



From (]73|, (|82j and vec(£p F (2)) = P jR vec(^(S FF (2))) 0, under conditions (|72j,([74j),(|76j) and (|80j), 
w.h.p., 

II^ff(2)|| f < l|Ai|| F ||B*|| F + ||A*|| F ||B || F 



+ ||A 1 || F ||B°|| F 



V n n 



(83) 



where 



C 'j : max | ||B 



*0 M 2 



Ci(l + Cl ) 



2y2^ K (A )||A imi || 2 Br x ||A- 1 4t A || 2 } 
C< 4 := 2 V / 2^Br a; ||Ari t A || 2 C' 1 (l + <*) 



For 



(84) 



then, from (83 1 w.h.p., 



|S FF (2)|| F <(7 3 (l + c 2 )(p + 2/) 



log M 



The proof for = 2 iterations is complete. Using a simple induction, it follows that the rate ( 17 1 holds 
for all k finite. 

Next, we show that the convergence rate in the precision matrix Frobenius error is on the same order 



as the covariance matrix error. Let 0^(2) := T,pp(2) . From (81 1, for 

n > (e / ||X,|| 2 Ci(l + Cl )) 2 (v/7 + pf- 1/2 ) 2 log M 

then w.h.p., 

||X 2 - X4 F < (1 - i)- 1 ||X*||^C , i(l + ci) 



V n 



(85) 
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Using (75 1 and (85 1, we have w.h.p., 

||©ff(2) — &o\\p — 11^2 — X*|| F ||Y*|| F 

+ || Yi — || F ||X*|| p + ||X2 — X* || F || Yi — || F 



V n y p n 



where 



D X = ||0 O || 2 - ^) max{/c(A )||A^ t || a Ci(l + c x ), 

2 v ^||e || 2 [|A init || 2 Br !B ||A^ t Ao[| 2 } 
£>2 = (1 - i)- 2 2 x /2^||0 o ||^Br :r ||A-^Ao|| 2 C' 1 (l + ci) 



For 



n> fe } ( 2 /+ p } l0gM 



the bound (861 becomes w.h.p., 



|0 FF (2) - O || F < Di(l + d')(2/ + p)l ' '° g }f 



n 



Thus, the same rate Op I \l ^ + ^ ' Iog M I holds for the precision matrix Frobenius error. 



Appendix M 
Proof of Theorem[5] 

(k) A <lc) (k) A <fc) 

Proof: For ease of notation, let \ x = and X Y = for all > 1. First, we show that the 
first iteration of the KGL algorithm yields a fast statistical convergence rate of Op ^^/ ( p+ -^ logM ^ by 
appropriately adjusting the regularization parameters. A simple induction finishes the proof. Adopt the 
notation from the proof of Thm. [4] 
Lemma [5] implies that for 

n> v ' logM (87) 

Wt 

then w.h.p., 



|B°|oo < 2 x /2^Bri|A^A Q || 2 ,/^ (88) 
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where B° = B(Aj n jt) — B*. From Proposition 111 and (88 1, we obtain w.h.p., 



|Yi - Y,|| F < 2^2(1 + Cy)y/T+^ \\Y*\ 



x 2 x /2^Br a ||Ari t A || 2 J f -^- (89) 

y n P 

where we also used sy < cy / and Yi := G(B(Ai n it), Ay ) = B^ 1 . Note that /p -1 logM = o(n) 
was used here. 

Let A 1 := A(Bi) - A*. Then, we have 

vec(A 1 ) = yRAvec(Yi) - vec(A*) 

= ^R A vec(Y! - Y*) + iR A vec(B; 1 ) 

= iR A vec(Yi - Y») + vec(A(B*) - A,) + -R A vec(Yi - Y„) 
= tr(Bo(Y ^~ Y * )) vec(Ao) + vec(A(B*) - A,) 

+ -R A vec(Yi - Y*) (90) 
where we used Ha = vec(Ao)vec(B^) T (see Eq. (91) in 10). Recall the definition of a mixed norm 

||W|U= sup ||Wv|L (91) 

l|v|| Q =i 



From (89 1 and (91 1, we have w.h.p., 

illRWYx-Y^IL^^-ir 
<8^Ar x Br x ||Bo|| 2 ||A-i t A || 2 



2^2(1 + cy)y/l + c Yo \\Y4 2 ^-^^- (92) 

\/P n 



where we used the bound on the mixed-norm-i.e., for n > 8 ^ r - ) logM, then w.h.p. (from Lemma 

f = 7 sup URavHoo 

J J ||v|| 2 =l 

1- <vec(B ),v> 

= sup || 7 Rav vec(A )|| oo 

llv|L=i J J 



5), 



<2 V / 2^Ar a: ||Bo|| 2 J^ 
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From ([90]), applying the triangle inequality and using the Cauchy-Schwarz inequality, (89 1, (92 1, w.h.p., 



lA 1 !^ < l tr (B (Yi Y *))I | Aq | oo + |A(B*) -A*\ 00 + illR^vecCYi - Y*) 



< 



VTHBolUYx-Y 



* II F 



f 



A |oo + |A(B*) - A* | 



1 



+ y||R A vec(Y 1 - Y, 



< 5l >4 Wf.ft'W 

\Vp VfJ V n y/p n 

where we used Lemma [5] The constants C\ and C2 are given by: 

G\ = max|K(So)||0o|| 2 l|Aj n it||22\/2(l + c y )y/l + c Yo 

xQv^BgHIA^Aolla, 



2 V / 2^A 



in ax 




P 



t r (A Aj tJ 



} 



C 2 = 2 v /2VvA^||Bo|| 2 • 2^2(1 + cj,) vT + ^T||Y. 



II 2 



tt(A A-^ 



2v^B^||A Ari 



init 1 1 2 



For 



the bound in ([93]) simplifies to 



C 2 2 logM 



Cic'- niv> 



A 1 !^ <Ci(l + C ') 



+ 



logM 



n 



From Proposition [T] and (95 1, we obtain w.h.p.: 

||Xi - X,|| F < 2^2(1 + Px)Vl + cxol|X,||^Ci(l + c') 



x 1 + 



p\ /logM 



(93) 



(94) 



(95) 



/y v n 

where we used s Xa < c Xa p and Xi := G(A(B X ), A^), X* := A" 1 . Note that (1 + y/p/f) 2 logM 
o(n) was used here. 



(96) 
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Finally, using (89 1 and (96 1, we obtain w.h.p.: 

\\&kgl(2) -&q\\ f = ||Xi® Yi -X* ®Y*|| F 
< [|Yi - Y*|| FV ^||X*|| 2 + ||Xi - X»|| F v7l|Y.|| 2 
+ ||Yi — Y*|| F ||Xi — X*|| F 

'/.logM 



<C' 3 (2V/ + Vp)a/— + 4 (1 + 



n 



p n 



(97) 



■ M 2 



where the constants are given by 

C 3 = 2^||0o|| 2 max{(l + Cj / )yrT^||Y 

(i + caoyr+^Aa + c ')/<Ao)iiAry 2 } 

C4 = 8||© |||(1 + Cy)y/1+Cy (l + c x .)^l + c Xo Ci(l + d 



tr(A A 



_1 ) 

in it > 



For 



(98) 



the bound (97i further becomes: 



|0x GL (2) - e || < 3 (1 + c")(2\// + Vp) 



logM 



Note that ||0KGi(2) - © Q ||p = O p < (p+/+vW)logM 



Op 



n 

(p+f) logM 



as p,f,n — >■ oo. This 



concludes the first part of the proof. The rest of the proof follows by similar bounding arguments coupled 
with induction. The rate remains the same as the number of iterations increases, but the constant on front 
changes. 
For 



logM 



the bound (97i further becomes: 



\Gkgl(2) - &o\\ F < C 3 (l + c")(2V/ + VP) 



logM 



(p+f+Vpf) iggM 



Op 



(p+/) logM 



as p, f,n —> oo. This 



Note that \\&kgl(2) - ®o\\ F = O p 
concludes the first part of the proof. The rest of the proof follows by similar bounding arguments coupled 
with induction. The rate remains the same as the number of iterations increases, but the constant on front 
changes. 
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Next, we show that the convergence rate in the covariance matrix Frobenius error is on the same order 



as the inverse. From ( |89| ), for 

> ( e / 2V2(l + Cj / )7r+^) 2 ||Y,||^(Yo) 2 /p' 1 logAf 



n 



we have w.h.p. A m j„(Yi) > A m i n (Y*) - ||Yi - Y*|| F > (1 - ^)A m i n (Y*), which in turn implies w.h.p., 



|Bi — B* \\p < ( 1 



/ AogM 



n 



[[j Using a similar argument, from (96 1, for 

(e'2V2(l + c^y/l+^xodil + c')) 2 



n > 



A • (A ) 2 



k(X ) 2 



x (l + .fe 2 logM 



we have w.h.p., 



|Ax - A,|| F < (\ - ~j 2^2(1 + c^^l + cxAi 1 + <0 



xk(X ) 2 (i+ /; ' 



logM 



where Ai = X 1 . 

Let Skgl(2) := 0^ GL (2)- 1 = A a <g> B^ Then, w.h.p., 



\^kgl(2) — ^o\\f - [l-A-i ~~ A* I |B* 1 1 F 

+ || Bi — B*|| F || A*|| F + || Ai — A*||ji||Bi — B*|| F 

fclogAf 



< jDl (2V/ + VP)\/— + ^2(1 + 



11 



p n 



where the constants are: 



£>x = (1 - ij-^VSmax ((1 + c x )Vl +^XoCx{l + c') 



X K 



(x ) 2 ,(i + c y )vTT^(Y ) : 



d 2 = 8(1 - -r 2 a + c^yrr^u + c^yrr^ 

x k (X ) 2 k(Y ) 2 Ci(1 + c') 



"Here, Bi = Y x 1 exists since Yi is positive definite (see 1 10 ) 



(99) 



(100) 



(101) 
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For 

i-l .n - 



n>( ^ )2 ( vt+^' logM 



then < | 1 1 p implies w.h.p., 



\T, KGL (2) - SqIIf < £>i(l + d)(2\// + V^)1 R ' / 



Thus, the same rate Op ( \/ ( p+ -^ log — ) holds for the error in the covariance matrix. 



References 

[1] G. I. Allen and R. Tibshirani, "Transposable regularized covariance models with an application to missing data imputation," 

The Annals of Applied Statistics, vol. 4, no. 2, pp. 764-790, 2010. 
[2] M. Yuan and Y. Lin, "Model selection and estimation in the gaussian graphical model." Biometrika, vol. 94, pp. 19-35, 

2007. 

[3] O. Banerjee, L. E. Ghaoui, and A. d'Aspremont, "Model selection through sparse maximum likelihood estimation for 
multivariate Gaussian or binary data," Journal of Machine Learning Research, vol. 9, pp. 485-516, March 2008. 

[4] P. Dutilleul, "The mle algorithm for the matrix normal distribution," J. Statist. Comput. Siniul, vol. 64, pp. 105-123, 1999. 

[5] K. Werner, M. Jansson, and P. Stoica, "On estimation of covariance matrices with Kronecker product structure," IEEE 
Transactions on Signal Processing, vol. 56, no. 2, February 2008. 

[6] K. Werner and M. Jansson, "Estimation of kronecker structured channel covariances using training data," in Proceedings 
ofEUSIPCO, 2007. 

[7] A. Dawid, "Some matrix-variate distribution theory: notational considerations and a bayesian application," Biometrika, 

vol. 68, pp. 265-274, 1981. 
[8] A. K. Gupta and D. K. Nagar, Matrix Variate Distributions. Chapman Hill, 1999. 
[9] N. Cressie, Statistics for Spatial Data. Wiley, New York, 1993. 
[10] J. Yin and H. Li, "Model selection and estimation in the matrix normal graphical model," Journal of Multivariate Analysis, 

vol. 107, pp. 119-140, 2012. 

[11] K. Yu, J. Lafferty, S. Zhu, and Y. Gong, "Large-scale collaborative prediction using a nonparametric random effects model," 
ICML, pp. 1185-1192, 2009. 

[12] E. Bonilla, K. M. Chai, and C. Williams, "Multi-task gaussian process prediction," Advances in Neural Information 

Processing Systems, pp. 153-160, 2008. 
[13] Y. Zhang and J. Schneider, "Learning multiple tasks with a sparse matrix-normal penalty," Advances in Neural Information 

Processing Systems, vol. 23, pp. 2550-2558, 2010. 
[14] N. Lu and D. Zimmerman, "On likelihood-based inference for a separable covariance matrix," Statistics and Actuarial 

Science Dept., Univ. of Iowa, Iowa City, IA, Tech. Rep., 2004. 
[15] J. Friedman, T. Hastie, and R. Tibshirani, "Sparse inverse covariance estimation with the graphical lasso," Biostatistics, 

vol. 9, no. 3, pp. 432-441, 2008. 



February 13, 2013 



DRAFT 



59 



[16] S. Zhou, J. Lafferty, and L. Wasserman, "Time varying undirected graphs," Journal of Machine Learning Research, vol. 80, 
pp. 295-319, 2010. 

[17] A. Rothman, P. Bickel, E. Levina, and J. Zhu, "Sparse permutation invariant covariance estimation," Electronic Journal of 

Statistics, vol. 2, pp. 494-515, 2008. 
[18] P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu, "High-dimensional covariance estimation by minimizing .^-penalized 

log-determinant divergence," Advances in Neural Information Processing Systems, 2008. 
[19] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. 

[20] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar, "Sparse inverse covariance matrix estimation using quadratic 

approximation," Advances in Neural Information Processing Systems, vol. 24, 2011. 
[21] R. G. Bartle and D. R. Sherbert, Introduction to Real Analysis. John Wiley & Sons, 2000. 
[22] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis. Springer, 1998. 

[23] L. Isserlis, "On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number 

of variables," Biometrika, vol. 12, 1918. 
[24] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1990. 

[25] N. Lu and D. Zimmerman, "The likelihood ratio test for a separable covariance matrix," Statistics and Probability Letters, 
vol. 73, no. 5, pp. 449-457, May 2005. 



February 13, 2013 



DRAFT 



